Background and Overview

DataCamp offer interactive courses related to Python Programming. Since R Markdown documents can run simple Python code chunks (though the data is not accessible to future chunks, a large difference from R Markdown for R), this document attempts to summarize notes from the first module.

Topic areas summarized include:

Python Programming

Intro to Python for Data Science

Chapter 1 - Python Basics

Hello Python! - focusing on Python specific to data science:

  • Designed by Guido Van Rossum (started as a hobby), but has become a general purpose language that can build anything
  • Python is open-source, free, and has packages for data science
  • This course will focus on Python 3.x given that support for Python 2.7 has been (and will continue to) decreasing
  • Python scripts are simply text files with a .py extension - must use print() inside scripts in order to force printing

Variables and Types - variables names are case-sensitive in Python:

  • The single equals sign is the assignment operator
  • The type(myVar) call will return the type of the variable - float, integer (“int”), string (“str”), boolean (“bool”), etc.
    • The booleans are represented as proper-noun capitalization - True and False
  • String summation is concatenation without spacing (roughly the same as paste0() in R) – “ab” + “cd” = “abcd” ; note also that “ab” * 2 = “abab”
    • In general, different types of data will respond differently to the same function

Example code includes:


# Example, do not modify!
print(5 / 8)

# Put code below here
print(7 + 10)

# Recall that commented lines are marked by the hash-sign, same as R
# Exponentiation is ** and modulo division is %

# Addition and subtraction
print(5 + 5)
print(5 - 5)

# Multiplication and division
print(3 * 5)
print(10 / 2)

# Exponentiation
print(4 ** 2)

# Modulo
print(18 % 7)

# How much is your $100 worth after 7 years?
print(100 * 1.1**7)


# Create a variable savings
savings = 100

# Print out savings
print(savings)


# Create a variable savings
savings = 100

# Create a variable factor
factor = 1.10

# Calculate result
result = savings * factor ** 7

# Print out result
print(result)


# Create a variable desc
desc = "compound interest"

# Create a variable profitable
profitable = True


# Several variables to experiment with
savings = 100
factor = 1.1
desc = "compound interest"

# Assign product of factor and savings to year1
year1 = savings * factor

# Print the type of year1
print(type(year1))

# Assign sum of desc and desc to doubledesc
doubledesc = desc + desc

# Print out doubledesc
print(doubledesc)


# Definition of savings and result
savings = 100
result = 100 * 1.10 ** 7

# Fix the printout
print("I started with $" + str(savings) + " and now have $" + str(result) + ". Awesome!")

# Definition of pi_string
pi_string = "3.1415926"

# Convert pi_string into float: pi_float
pi_float = float(pi_string)
## 0.625
## 17
## 10
## 0
## 15
## 5.0
## 16
## 4
## 194.87171000000012
## 100
## 194.87171000000012
## <class 'float'>
## compound interestcompound interest
## I started with $100 and now have $194.87171000000012. Awesome!

The output all comes at once, another difference from R Markdown for R. In combination with being unable to access any of the variables later in the same document, there are tangible limitations to this approach.

Using Python within R Markdown may be more useful if I install “feather” for both Python and R. Feather allows for running code in Python, then quick-saving pandas in a way that is quick-readable as frames for the next R chunk. See https://blog.rstudio.org/2016/03/29/feather/.

Getting feather for R took just a few seconds using install.packages(). Getting feather for Python 3.6 using Windows seems to require a C++ 14.0 compiler from MS Visual Studio. So far, that is easier said than done.


Chapter 2 - Lists

What are lists? Multiple vales in one variable, formed using square brackets such as myList = [a, b, c]:

  • The elements of a list may be of any type, including lists

Subsetting lists - the first element in the list is defined as element 0:

  • Subsetting can be done as myList[myIndex]
  • Alternately, subsetting can be done using negative numbers, with -1 being the last element of the list
  • List slicing can be run using the colon operator
    • myList[a:b] will start with index a and end with index b-1
    • myList[:b] means go from start to index b-1, while myList[a:] means go from a to the end of the list

List manipulation - changing, adding, or removing elements:

  • Changing elements is based on using the indices and the equal sign - myList[myIndex] = myNewValue
  • The addition operator will concatenate the various lists
    • myList + [a, b] will produce a new list [myList, a, b]
  • Deleting elements from a list uses the del() operator - for example, del(myList[2]) will delete the third item of myList which occupies index 2
  • Behind the scense, Python is storing the data and the references to the data
    • Importantly, this means that copying a list and then editing the copy will edit the original list also; the pointers are to the same underlying data
    • Basically, myNewList = myList is copying the references to the data that are contained in myList, rather than copying all the data and the references
    • On the other hand, myNewList = myList[:] or myNewList = list(myList) will make the full, independent copy of the data with new references

Example code includes:


# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50

# Create list areas
areas = [hall, kit, liv, bed, bath]

# Print areas
print(areas)


# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50

# Adapt list areas
areas = ["hallway", hall, "kitchen", kit, "living room", liv, "bedroom", bed, "bathroom", bath]

# Print areas
print(areas)


# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50

# house information as list of lists
house = [["hallway", hall],
         ["kitchen", kit],
         ["living room", liv],
         ["bedroom", bed], 
         ["bathroom", bath]
         ]

# Print out house
print(house)

# Print out the type of house
print(type(house))


# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]

# Print out second element from areas
print(areas[1])

# Print out last element from areas
print(areas[-1])

# Print out the area of the living room
print(areas[5])


# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]

# Sum of kitchen and bedroom area: eat_sleep_area
eat_sleep_area = areas[3] + areas[7]

# Print the variable eat_sleep_area
print(eat_sleep_area)


# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]

# Use slicing to create downstairs
downstairs = areas[:6]

# Use slicing to create upstairs
upstairs = areas[6:]

# Print out downstairs and upstairs
print(downstairs)
print(upstairs)


# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]

# Correct the bathroom area
areas[-1] = 10.5

# Change "living room" to "chill zone"
areas[4] = "chill zone"


# Create the areas list and make some changes
areas = ["hallway", 11.25, "kitchen", 18.0, "chill zone", 20.0,
         "bedroom", 10.75, "bathroom", 10.50]

# Add poolhouse data to areas, new list is areas_1
areas_1 = areas + ["poolhouse", 24.5]

# Add garage data to areas_1, new list is areas_2
areas_2 = areas_1 + ["garage", 15.45]


# Create list areas
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Create areas_copy
areas_copy = list(areas)

# Change areas_copy
areas_copy[0] = 5.0

# Print areas
print(areas)
## [11.25, 18.0, 20.0, 10.75, 9.5]
## ['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0, 'bedroom', 10.75, 'bathroom', 9.5]
## [['hallway', 11.25], ['kitchen', 18.0], ['living room', 20.0], ['bedroom', 10.75], ['bathroom', 9.5]]
## <class 'list'>
## 11.25
## 9.5
## 20.0
## 28.75
## ['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0]
## ['bedroom', 10.75, 'bathroom', 9.5]
## [11.25, 18.0, 20.0, 10.75, 9.5]

Chapter 3 - Functions and Packages

Introduction to functions - pieces of reusable code for solving a particular task:

  • Built-in functions are things like max() or type() or round(myNum, myDecimals)
  • Can use help(builtInFunction) to get the help page for builtInFunction

Methods - all objects of a specific type have default access to the methods for that object:

  • Methods are functions that belong to an object
  • For example, myList.index(“mySearch”) will return the index that matches to “mySearch” (if a number, should not be quoted)
    • Alternately, myList.count(“mySearch”) will return the number of matches to “mySearch”
  • The methods will behave differently (perhaps even not existing) for different object types
  • Further, some methods modify the object that they are associated with; for example .append()

Packages are directoried of pyhton scripts, each a module specifying functions, methods, and types:

  • Thousands of Python packages are available, including Numpy, Matplotlib, and Scikit-learn
  • Installing packages is based on the “pip” system - download get-pip.py from http://pip.readthedocs.org/en/stable/installing
    • Then, uses “pip3 install myPackage” (unquoted) at the command line
    • On my machine, needs to be at command line, then [PythonPath].exe -m pip install myPackage
  • Packages can then be imported using “import myPackage” (unquoted) at the command line
  • The package always needs to be attached to its command, for example numpy.array() rather than just array()
    • As a result, it is often helpful to use import numpy as np, so that np.array() can serve as a shortcut for numpy.array()
  • Alternately, can ask for “from numpy import array” if only wanting to import the function array()
    • Now, array() can also be called without any prefix; for example, as array(myNumbers) rather than numpy.array(myNumbers)

Example code includes:


# Create variables var1 and var2
var1 = [1, 2, 3, 4]
var2 = True

# Print out type of var1
print(type(var1))

# Print out length of var1
print(len(var1))

# Convert var2 to an integer: out2
out2 = int(var2)


# Create lists first and second
first = [11.25, 18.0, 20.0]
second = [10.75, 9.50]

# Paste together first and second: full
full = first + second

# Sort full in descending order: full_sorted
full_sorted = sorted(full, reverse=True)

# Print out full_sorted
print(full_sorted)


# string to experiment with: room
room = "poolhouse"

# Use upper() on room: room_up
room_up = room.upper()

# Print out room and room_up
print(room)
print(room_up)

# Print out the number of o's in room
print(room.count("o"))


# Create list areas
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Print out the index of the element 20.0
print(areas.index(20.0))

# Print out how often 14.5 appears in areas
print(areas.count(14.5))


# Create list areas
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Use append twice to add poolhouse and garage size
areas.append(24.5)
areas.append(15.45)

# Print out areas
print(areas)

# Reverse the orders of the elements in areas
areas.reverse()

# Print out areas
print(areas)


# Definition of radius
r = 0.43

# Import the math package
import math

# Calculate C
C = 2 * math.pi * r

# Calculate A
A = math.pi * (r ** 2)

# Build printout
print("Circumference: " + str(C))
print("Area: " + str(A))


# Definition of radius
r = 192500

# Import radians function of math package
from math import radians

# Travel distance of Moon over 12 degrees. Store in dist.
dist = r * radians(12)

# Print out dist
print(dist)
## <class 'list'>
## 4
## [20.0, 18.0, 11.25, 10.75, 9.5]
## poolhouse
## POOLHOUSE
## 3
## 2
## 0
## [11.25, 18.0, 20.0, 10.75, 9.5, 24.5, 15.45]
## [15.45, 24.5, 9.5, 10.75, 20.0, 18.0, 11.25]
## Circumference: 2.701769682087222
## Area: 0.5808804816487527
## 40317.10572106901

Chapter 4 - Numpy

Numpy extends list operations using “Numerical Python” (collections of values, optimized for speed):

  • The Numpy array is like a list, but you can run mathematical calculations with it
    • For example, [1, 2, 3] * 2 is [1, 2, 3, 1, 2, 3] while [1, 2, 3] **2 throws an error
    • However, numpy.array([1, 2, 3]) * 2 is array([2, 4, 6]) while numpy.array([1, 2, 3]) ** 2 is array([1, 4, 9]), both as expected
  • The basic structure of numpy.array() is a vector, which will operate element-wise
    • Numpy arrays must be of a single-type, converted to the “most flexible” (e.g., string is more flexible than float is more flexible than boolean)
  • The plus sign with a numpy.array() will add element-wise rather than pasting (as it would with lists)
  • Can also use logical subsetting; for example, bmi[bmi > 23] will return all bmi that are greater than 23

2D Numpy Arrays - extending the vector to be multi-dimensional:

  • For a numpy vector/array, the type will be numpy.ndarray (stands for n-dimensional array)
  • Can create a two-dimensional array much like an array of lists; numpy.array( [ [1, 2, 3], [4, 5, 6] ] )
    • The .shape() method will give the dimensions of the array as rows, columns
  • Selecting a row is just based on myArray[rowIndex], so a specific cell can be extracted with myArray[rowIndex][colIndex]
    • Alternately, myArray[rowIndex, colIndex] will also return the specified row and column
    • Can also use myArray[:, colIndex] to get just the specified column(s)
  • The 2D Numpy arrays can also be used for element-wise operations
  • The 2D Numpy arrays can also be used for element-wise operations

Numpy Basic Statistics - basic data exploration:

  • numpy.mean() will take the mean of the relevant data
  • numpy.median() will take the median of the relevant data
  • numpy.corrcoef() will create the correlation coefficients
  • numpy.std() will take the standard deviation
  • numpy.sum() and numpy.sort() are faster than the base versions since numpy has enforced common data types within the array
  • Note that Filip manufactured the MLB data as follows
    • height = numpy.round(numpy.random.normal(1.75, 0.20, 5000), 2)
    • weight = numpy.round(numpy.random.normal(60.32, 15, 5000), 2)
    • np_baseball = np.column_stack((height, weight))

Example code includes:


# Create list baseball
baseball = [180, 215, 210, 210, 188, 176, 209, 200]

# Import the numpy package as np
import numpy as np

# Create a Numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out type of np_baseball
print(type(np_baseball))


# DO NOT HAVE THE HEIGHT OR WEIGHT DATA - it is MLB data on 1000 players
# Create dummy data
height = np.round(np.random.normal(1.75, 0.20, 5000), 2)  
weight = np.round(np.random.normal(60.32, 15, 5000), 2)  


# Create a Numpy array from height: np_height
np_height = np.array(height)

# Print out np_height
print(np_height)

# Convert np_height to m: np_height_m
np_height_m = np_height * 0.0254

# Print np_height_m
print(np_height_m)


# Create array from height with correct units: np_height_m
np_height_m = np.array(height) * 0.0254

# Create array from weight with correct units: np_weight_kg
np_weight_kg = np.array(weight) * 0.453592

# Calculate the BMI: bmi
bmi = np_weight_kg / (np_height_m ** 2)

# Print out bmi
print(bmi)


# Calculate the BMI: bmi
np_height_m = np.array(height) * 0.0254
np_weight_kg = np.array(weight) * 0.453592
bmi = np_weight_kg / np_height_m ** 2

# Create the light array
light = bmi < 21

# Print out light
print(light)

# Print out BMIs of all baseball players whose BMI is below 21
print(bmi[light])


# Store weight and height lists as numpy arrays
np_weight = np.array(weight)
np_height = np.array(height)

# Print out the weight at index 50
print(np_weight[50])

# Print out sub-array of np_height: index 100 up to and including index 110
print(np_height[100:111])


# Create baseball, a list of lists
baseball = [[180, 78.4],
            [215, 102.7],
            [210, 98.5],
            [188, 75.2]]

# Import numpy
import numpy as np

# Create a 2D Numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out the type of np_baseball
print(type(np_baseball))

# Print out the shape of np_baseball
print(np_baseball.shape)


# DO NOT HAVE baseball, which is a list of lists of the 1015 MLB players with their height/weight
# Create a 2D Numpy array from baseball: np_baseball
# np_baseball = np.array(baseball)
# Dummy up the data instead
np_baseball = np.column_stack((height, weight))  

# Print out the shape of np_baseball
print(np_baseball.shape)  # 1015 x 2


# Create np_baseball (2 cols)
# np_baseball = np.array(baseball)

# Print out the 50th row of np_baseball
print(np_baseball[49])

# Select the entire second column of np_baseball: np_weight
np_weight = np_baseball[:, 1]

# Print out height of 124th player
print(np_baseball[123, 0])


# DO NOT HAVE baseball OR updated ; each should be 1,015 x 3 (height, weight, bmi)
# Create np_baseball (3 cols)
# np_baseball = np.array(baseball)

# Print out addition of np_baseball and updated
# print(np_baseball + updated)

# Create Numpy array: conversion
# conversion = np.array([0.0254, 0.453592, 1])

# Print out product of np_baseball and conversion
# print(np_baseball * conversion)


# Create np_height from np_baseball
np_height = np_baseball[:, 0]

# Print out the mean of np_height
print(np.mean(np_height))

# Print out the median of np_height
print(np.median(np_height))


# Print mean height (first column)
avg = np.mean(np_baseball[:,0])
print("Average: " + str(avg))

# Print median height. Replace 'None'
med = np.median(np_baseball[:,0])
print("Median: " + str(med))

# Print out the standard deviation on height. Replace 'None'
stddev = np.std(np_baseball[:,0])
print("Standard Deviation: " + str(stddev))

# Print out correlation between first and second column. Replace 'None'
corr = np.corrcoef(np_baseball[:, 0], np_baseball[:, 1])
print("Correlation: " + str(corr))


# DO NOT HAVE DATA for positions or heights (soccer data . . . )
# Convert positions and heights to numpy arrays: np_positions, np_heights
# np_positions = np.array(positions)
# np_heights = np.array(heights)

# Heights of the goalkeepers: gk_heights
# gk_heights = np_heights[np_positions == "GK"]

# Heights of the other players: other_heights
# other_heights = np_heights[np_positions != "GK"]

# Print out the median height of goalkeepers. Replace 'None'
# print("Median height of goalkeepers: " + str(np.median(gk_heights)))

# Print out the median height of other players. Replace 'None'
# print("Median height of other players: " + str(np.median(other_heights)))
## <class 'numpy.ndarray'>
## [ 1.57  1.46  1.77 ...,  1.57  1.7   1.43]
## [ 0.039878  0.037084  0.044958 ...,  0.039878  0.04318   0.036322]
## [ 13999.19948933  21910.71217759  10040.31642715 ...,  19666.76456376
##   14460.35353801  17902.4906595 ]
## [False False False ..., False False False]
## []
## 45.83
## [ 1.65  1.47  1.46  1.63  1.79  1.55  1.68  1.59  1.78  2.19  1.54]
## <class 'numpy.ndarray'>
## (4, 2)
## (5000, 2)
## [  1.43  63.22]
## 1.39
## 1.747684
## 1.75
## Average: 1.747684
## Median: 1.75
## Standard Deviation: 0.201613482049
## Correlation: [[ 1.          0.00115235]
##  [ 0.00115235  1.        ]]

Intermediate Python for Data Science

Chapter 1 - Matplotlib for Data Visualization

Basic plots with matplotlib - generally, the heart of visualization within Python:

  • Need to import the key functions; for example import matplotlib.pyplot as plt
  • Then, plt.plot(list1, list2) will create a line plot with list1 being x and list2 being y
    • If you want to actually see the plot, use plt.show(), somewhat like plt.plot() just being a saved ggplot2 object
  • Alternately, plt.scatter() to create a scatter plot

Histograms are useful for exploring a dataset (getting an idea about the distribution):

  • import matplotlib.pyplot as plt # help(plt.hist) will show all the options for a histogram
  • plt.hist(x, bins=myBins) # default for myBins is 10
    • Needs plt.show() as per the above

Customization for changing the base plot types in Python:

  • Can label x-axis with plt.xlabel(‘X Label’)
  • Can label y-axis with plt.xlabel(‘Y Label’)
  • Can add title with plt.title(‘My Title’)
  • Can add plt.yticks([myList], [myNames]) # myList can be 2+ elements which will define the y-range; optional list myNames must be the same length as myList and will label the y-axis
    • All of these must be run PRIOR to the plt.show() command

Example code includes:


# Define the reading data path
readPath = "C:/Users/Dave/Documents/Personal/Learning/Coursera/RDirectory/RHomework/DataCamp/"

# This is world population 1950-2100 (DO NOT HAVE FILE)
# Import some wikipedia data from CSV as panda
import pandas as pd

globalPop = pd.read_csv(readPath + "GlobalPopYear_1950_2100_v001.csv")

year = globalPop["year"]
pop = globalPop["pop"]

# Print the last item from year and pop
print(year.iloc[-1])
print(pop.iloc[-1])

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Make a line plot: year on the x-axis, pop on the y-axis
plt.plot(year, pop)

# Display the plot with plt.show()
# Need to use a proper Python IDE for plt.show() - otherwise just pops up the images "live"
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy001.png", bbox_inches="tight")
## 2100
## 11000000002

The population plot saved from Python is:

Next, the Hans Rosling Data is explored:


# Using the Hans Rosling Data (2007 life expectancy and GDP for 142 countries)
# Create from Wikipedia, World Bank, and the like
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# readPath = "C:\\Users\\Dave\\Documents\\Personal\\Learning\\Coursera\\RDirectory\\RHomework\\DataCamp\\"
readPath = "C:/Users/Dave/Documents/Personal/Learning/Coursera/RDirectory/RHomework/DataCamp/"


globalData = pd.read_csv(readPath + "GlobalGDPLifeExpectancy_v001.csv")

gdp_cap = 1000000 * np.array(globalData["gdp"]) / np.array(globalData["pop"])
life_exp = globalData["le_2015"]
pop = globalData["pop"]
life_exp1950 = globalData["le_1960"]  # Much easier to get 1960 than 1950 online - KLUGE
regn = globalData["region"]

# Print the last item of gdp_cap and life_exp
print(gdp_cap[-1])  # Since it is a numpy
print(life_exp.iloc[-1])  # Since it is a panda

# Make a line plot, gdp_cap on the x-axis, life_exp on the y-axis
plt.plot(gdp_cap, life_exp)

# Display the plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy002.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Change the line plot below to a scatter plot
plt.scatter(gdp_cap, life_exp)

# Put the x-axis on a logarithmic scale
plt.xscale('log')

# Show plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy003.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Brings in yet another variable, population

# Build Scatter plot
plt.scatter(pop, life_exp)
plt.xscale("log")

# Show plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy004.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Create histogram of life_exp data
plt.hist(life_exp)

# Display histogram
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy005.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Build histogram with 5 bins
plt.hist(life_exp, bins=5)

# Show and clean up plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# plt.clf()
# Save as dummy PNG instead
plt.savefig("_dummyPy006.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Build histogram with 20 bins
plt.hist(life_exp, bins=20)

# Show and clean up again
# Need to use a proper Python IDE for plt.show()
# plt.show()
# plt.clf()
# Save as dummy PNG instead
plt.savefig("_dummyPy007.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Histogram of life_exp, 15 bins
plt.hist(life_exp, bins=15)

# Show and clear plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# plt.clf()
# Save as dummy PNG instead
plt.savefig("_dummyPy008.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Histogram of life_exp1950, 15 bins
plt.hist(life_exp1950, bins=15)

# Show and clear plot again
# Need to use a proper Python IDE for plt.show()
# plt.show()
# plt.clf()
# Save as dummy PNG instead
plt.savefig("_dummyPy009.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Basic scatter plot, log scale
plt.scatter(gdp_cap, life_exp)
plt.xscale('log') 

# Strings
xlab = 'GDP per Capita [in USD]'
ylab = 'Life Expectancy [in years]'
title = 'World Development in 2007'

# Add axis labels
plt.xlabel(xlab)
plt.ylabel(ylab)

# Add title
plt.title(title)

# After customizing, display the plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy010.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Scatter plot
plt.scatter(gdp_cap, life_exp)

# Previous customizations
plt.xscale('log') 
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')

# Definition of tick_val and tick_lab
tick_val = [1000,10000,100000]
tick_lab = ['1k','10k','100k']

# Adapt the ticks on the x-axis
plt.xticks(tick_val, tick_lab)

# After customizing, display the plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy011.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Import numpy as np
import numpy as np

# Store pop as a numpy array: np_pop
np_pop = np.array(pop) / 1000000  # Population in millions

# Double np_pop
np_pop = np_pop * 2 # Doubled for larger bubbles

# Update: set s argument to np_pop
plt.scatter(gdp_cap, life_exp, s = np_pop)

# Previous customizations
plt.xscale('log') 
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000, 10000, 100000],['1k', '10k', '100k'])

# Display the plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy012.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Color is based on continent, using the below dictionary
colDict = {
    'Asia':'red',
    'Europe':'green',
    'Africa':'blue',
    'Americas':'yellow',
    'Oceania':'black'
}

col=[]

for eachRegion in regn :
    col.append(colDict[eachRegion])

# Specify c and alpha inside plt.scatter()
plt.scatter(x = gdp_cap, y = life_exp, s = np_pop , c=col, alpha=0.8)

# Previous customizations
plt.xscale('log') 
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])

# Show the plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# plt.clf()
# Save as dummy PNG instead
plt.savefig("_dummyPy013.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Scatter plot
plt.scatter(x = gdp_cap, y = life_exp, s = np_pop, c = col, alpha = 0.4)

# Previous customizations
plt.xscale('log') 
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])

# Additional customizations
plt.text(1550, 71, 'India')
plt.text(5700, 80, 'China')

# Add grid() call
plt.grid(True)

# Show the plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy014.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting
## 888.906425266
## 59.2

GDP vs Life Expectancy by Country as Line Graph (not good . . . ):

GDP vs Life Expectancy by Country as Scatter Plot:

GDP vs Life Expectancy by Country as Scatter Plot with Log Scale:

Life Expectancy Histogram (default 10 bins):

Life Expectancy Histogram (5 bins):

Life Expectancy Histogram (20 bins):

Life Expectancy Histogram for 2015 (15 bins):

Life Expectancy Histogram for 1960 (15 bins):

Base Rosling-like graph (GDP vs Life Expectancy by Country Scatter):

Rosling-like graph (enhanced tick labels):

Rosling-like graph (bubble size ~ population):

Rosling-like graph (bubble color based on region):

Rosling-like graph (semit-transparent bubbles):


Chapter 2 - Dictionaries and Pandas

Dictionaries, Part I - key-value pairs:

  • The dictionary is created with curly brackets, with key-value pairs denoted by a colon and separated by a comma
    • world = { “afghanistan”:31, “albania”:2.8, “algeria”:39 } # sets up three key-value pairs as the dictionary called world
    • Now, world[“albania”] will return 2.8, the value that is associated with key “albania”
  • Dictionary look-ups are extremely fast even for enormous dictionaries

Dictionaries, Part II:

  • Dictionaries need to have unique keys; if duplicate keys are included, the value associated with the LAST key is retained
  • The keys also need to be immutable objects, which is to say strings or booleans or integers or floats (but not lists, since you can change their contents dynamically)
  • Assigning (or changing) key-value pairs in a dictionary is myDict[myKey] = myValue
  • To test whether a key is in the dictionary, use myKey in myDict # returns boolean True or False
  • To delete an item from the dictionary, use del(myDict[myKey]) # the full key-value pair is removed
  • Lists and dictionaries have many similarities, but also some key differences
    • Lists are indexed by a range of numbers, making them ideal for collections of values where the order matters
    • Dictionaries are indexed by unique keys, making them ideal for fast look-ups (they are also inherently completely unordered/unsorted based on how they are hashed)

Pandas, Part I - tabular dataset storage and manipulation:

  • Same general philosophy where rows are observations and columns are attributes/variables
  • Basically, need a form of numpy.array() that allows for different variable types in different columns
  • The pandas package provides a high-level data-manipulation tool (built on numpy by Wes McKinney)
    • The pandas package conveniently stores data as a DataFrame
    • Generally, the rows and columns will all have unique names
    • Further, the columns can all be of different types
  • Suppose that you create a dictionary where the keys are the desired column labels while the values are a list of the desired values for the column
    • import pandas as pd
    • myFrame = pd.DataFrame(myDict)
    • myFrame.index = labelList # optional, if wanting to provide row-names
  • Alternately, the data can be imported such as from a CSV
    • pd.read_csv(myCSVPath, index_col=myIndex) # index_col is optional and needed only if an index column has been provided

Pandas, Part II - indexing and selecting data from a DataFrame using square brackets, loc, and iloc:

  • myFrame[colNameQuoted] will return a subset of the panda with type pandas.core.series.Series
  • myFrame[[colNameQuoted]] will return a single-column panda with type pandas.core.frame.DataFrame
  • myFrame[[colName1Quoted, colName2Quoted]] will return a two-column panda
  • myFrame[a:b] will return rows rather than columns, starting with index a and ending at index b-1
  • The loc and iloc tools are designed to extend Pandas data extraction to be more similar to numpy extractions such as [ rows, columns ]
    • myFrame.loc[rowNameQuoted] will return a panda series matching the ROW
    • myFrame.loc[[rowNameQuoted]] will return a panda frame containing just that ROW
    • myFrame.loc[[rowName1Quoted, rowName2Quoted, rowName3Quoted]] will return a panda frame containing the requested ROWS
    • myFrame.loc[[rowListQuoted], [colListQuoted]] will return just the specified rows and columns
    • myFrame.loc[:, [colListQuoted]] will return all rows and just the specified columns
  • The iloc function is the index-based version of loc for data access and extraction
    • myFrame.iloc[[rowIndices]] will return a panda frame containing just these ROWS
    • myFrame.iloc[[rowIndices], [colIndices]] will return a panda frame containing just these COLUMNS

Example code includes:


# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']

# Get index of 'germany': ind_ger
ind_ger = countries.index("germany")

# Use ind_ger to print out capital of Germany
print(capitals[ind_ger])


# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']

# From string in countries and capitals, create dictionary europe
europe = {
   'spain':'madrid', 
   'france':'paris', 
   'germany':'berlin', 
   'norway':'oslo'
}

# Print europe
print(europe)


# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }

# Print out the keys in europe
print(europe.keys())

# Print out value that belongs to key 'norway'
print(europe['norway'])


# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }

# Add italy to europe
europe['italy'] = 'rome'

# Print out italy in europe
print('italy' in europe)

# Add poland to europe
europe['poland'] = 'warsaw'

# Print europe
print(europe)


# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw',
          'australia':'vienna' }

# Update capital of germany
europe['germany'] = 'berlin'

# Remove australia
del(europe['australia'])

# Print europe
print(europe)


# Dictionary of dictionaries
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }


# Print out the capital of France
print(europe['france']['capital'])

# Create sub-dictionary data
data = { 'capital':'rome', 'population':59.83 }

# Add data to europe under key 'italy'
europe['italy'] = data

# Print europe
print(europe)


# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Import pandas as pd
import pandas as pd

# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = { 'country': names, 'drives_right': dr, 'cars_per_cap': cpc }

# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)

# Print cars
print(cars)


# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(dict)
print(cars)

# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars
cars.index = row_labels

# Print cars again
print(cars)


# DO NOT HAVE FILE "cars.csv" - cars_per_cap , country , drives_right
# Created as cars.to_csv("cars.csv")
# Import the cars.csv data: cars
cars = pd.read_csv("cars.csv")

# Print out cars
print(cars)


# SLIGHTLY DIFFERENT VERSION WITH ROW NAMES AS THE FIRST COLUMN
# Import pandas as pd
import pandas as pd

# Fix import by including index_col
cars = pd.read_csv('cars.csv', index_col=0)

# Print out cars
print(cars)


# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out country column as Pandas Series
print(cars["country"])

# Print out country column as Pandas DataFrame
print(cars[["country"]])

# Print out DataFrame with country and drives_right columns
print(cars[["country", "drives_right"]])


# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out first 3 observations
print(cars[0:3])

# Print out fourth, fifth and sixth observation
print(cars[3:6])


# Print out observation for Japan
print(cars.loc["JAP"])

# Print out observations for Australia and Egypt
print(cars.loc[["AUS", "EG"]])


# Print out drives_right value of Morocco
print(cars.loc[["MOR"], ["drives_right"]])

# Print sub-DataFrame
print(cars.loc[["RU", "MOR"], ["country", "drives_right"]])


# Print out drives_right column as Series
print(cars.loc[:, "drives_right"])

# Print out drives_right column as DataFrame
print(cars.loc[:, ["drives_right"]])

# Print out cars_per_cap and drives_right as DataFrame
print(cars.loc[:, ["cars_per_cap", "drives_right"]])
## berlin
## {'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo'}
## dict_keys(['spain', 'france', 'germany', 'norway'])
## oslo
## True
## {'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo', 'italy': 'rome', 'poland': 'warsaw'}
## {'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo', 'italy': 'rome', 'poland': 'warsaw'}
## paris
## {'spain': {'capital': 'madrid', 'population': 46.77}, 'france': {'capital': 'paris', 'population': 66.03}, 'germany': {'capital': 'berlin', 'population': 80.62}, 'norway': {'capital': 'oslo', 'population': 5.084}, 'italy': {'capital': 'rome', 'population': 59.83}}
##    cars_per_cap        country  drives_right
## 0           809  United States          True
## 1           731      Australia         False
## 2           588          Japan         False
## 3            18          India         False
## 4           200         Russia          True
## 5            70        Morocco          True
## 6            45          Egypt          True
##    cars_per_cap        country  drives_right
## 0           809  United States          True
## 1           731      Australia         False
## 2           588          Japan         False
## 3            18          India         False
## 4           200         Russia          True
## 5            70        Morocco          True
## 6            45          Egypt          True
##      cars_per_cap        country  drives_right
## US            809  United States          True
## AUS           731      Australia         False
## JAP           588          Japan         False
## IN             18          India         False
## RU            200         Russia          True
## MOR            70        Morocco          True
## EG             45          Egypt          True
##   Unnamed: 0  cars_per_cap        country  drives_right
## 0         US           809  United States          True
## 1        AUS           731      Australia         False
## 2        JAP           588          Japan         False
## 3         IN            18          India         False
## 4         RU           200         Russia          True
## 5        MOR            70        Morocco          True
## 6         EG            45          Egypt          True
##      cars_per_cap        country  drives_right
## US            809  United States          True
## AUS           731      Australia         False
## JAP           588          Japan         False
## IN             18          India         False
## RU            200         Russia          True
## MOR            70        Morocco          True
## EG             45          Egypt          True
## US     United States
## AUS        Australia
## JAP            Japan
## IN             India
## RU            Russia
## MOR          Morocco
## EG             Egypt
## Name: country, dtype: object
##            country
## US   United States
## AUS      Australia
## JAP          Japan
## IN           India
## RU          Russia
## MOR        Morocco
## EG           Egypt
##            country  drives_right
## US   United States          True
## AUS      Australia         False
## JAP          Japan         False
## IN           India         False
## RU          Russia          True
## MOR        Morocco          True
## EG           Egypt          True
##      cars_per_cap        country  drives_right
## US            809  United States          True
## AUS           731      Australia         False
## JAP           588          Japan         False
##      cars_per_cap  country  drives_right
## IN             18    India         False
## RU            200   Russia          True
## MOR            70  Morocco          True
## cars_per_cap      588
## country         Japan
## drives_right    False
## Name: JAP, dtype: object
##      cars_per_cap    country  drives_right
## AUS           731  Australia         False
## EG             45      Egypt          True
##      drives_right
## MOR          True
##      country  drives_right
## RU    Russia          True
## MOR  Morocco          True
## US      True
## AUS    False
## JAP    False
## IN     False
## RU      True
## MOR     True
## EG      True
## Name: drives_right, dtype: bool
##      drives_right
## US           True
## AUS         False
## JAP         False
## IN          False
## RU           True
## MOR          True
## EG           True
##      cars_per_cap  drives_right
## US            809          True
## AUS           731         False
## JAP           588         False
## IN             18         False
## RU            200          True
## MOR            70          True
## EG             45          True

Chapter 3 - Logic, Control Flow, and Filtering

Comparison Operators - how two values relate (tests for equality, greater, lesser, etc.):

  • Less than (<), greater than (>), equals (==), less than or equal (<=), greater than or equal (>=), and not equals (!=) are as per R
  • Need to have comparisons between objects of the same type (specifically, not comparing strings and floats)

Boolean operators - most commonly used are and, or, and not:

  • In Python, the word “and” is used rather than & or &&
  • In Python, the word “or” is used rather than | or ||
  • In Python, the word “not” is used rather than - or !
  • If comparisons will be run on an array, then use np.logical_and(), np.logical_or(), and np.logical_not()
    • np.logical_and(bmi > 27, bmi < 30)

If, elif, else:

  • General syntax is “if condition : action” followed optionally by “elif condition : action” or “else condition : action”
    • If written on multiple lines, the action should be indented by 4 spaces and may include block instructions
    • Any code without the indentation will be known to no longer be part of the if block

Filtering Pandas DataFrame - generally a three-step process of 1) select key column as panda.series, 2) run test, and 3) use to grab relevant rows:

  • If you pass myFrame[myBool] where myBool is the same size (number of rows) as myFrame, then it will automatically pull back the rows where myBool == True
  • Because pandas are built on the numpy infrastructure, np.logical_and() and the related terms will work on the pandas also

Example code includes:


# Comparison of booleans
print(True == False)

# Comparison of integers
print((-5 * 15) != 75)

# Comparison of strings
print("pyscript" == "PyScript")

# Compare a boolean with an integer
print(True == 1)


# Comparison of integers
x = -3 * 6
print(x >= -10)

# Comparison of strings
y = "test"
print("test" <= y)

# Comparison of booleans
print(True > False)


# Create arrays
import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])

# my_house greater than or equal to 18
print(my_house >= 18)

# my_house less than your_house
print(my_house < your_house)


# Define variables
my_kitchen = 18.0
your_kitchen = 14.0

# my_kitchen bigger than 10 and smaller than 18?
print(my_kitchen > 10 and my_kitchen < 18)

# my_kitchen smaller than 14 or bigger than 17?
print(my_kitchen < 14 or my_kitchen > 17)

# Double my_kitchen smaller than triple your_kitchen?
print(2 * my_kitchen < 3 * your_kitchen)


# Create arrays
import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])

# my_house greater than 18.5 or smaller than 10
print(np.logical_or(my_house > 18.5, my_house < 10))

# Both my_house and your_house smaller than 11
print(np.logical_and(my_house <11, your_house < 11))


# Define variables
room = "kit"
area = 14.0

# if statement for room
if room == "kit" :
    print("looking around in the kitchen.")

# if statement for area
if area > 15 :
    print("big place!")


# Define variables
room = "kit"
area = 14.0

# if-else construct for room
if room == "kit" :
    print("looking around in the kitchen.")
else :
    print("looking around elsewhere.")

# if-else construct for area
if area > 15 :
    print("big place!")
else :
    print("pretty small.")


# Define variables
room = "bed"
area = 14.0

# if-elif-else construct for room
if room == "kit" :
    print("looking around in the kitchen.")
elif room == "bed":
    print("looking around in the bedroom.")
else :
    print("looking around elsewhere.")

# if-elif-else construct for area
if area > 15 :
    print("big place!")
elif area > 10 :
    print("medium size, nice!")
else :
    print("pretty small.")


# AS PER ABOVE, DO NOT HAVE THIS DATASET
# That has since been worked around . . . 
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Extract drives_right column as Series: dr
dr = cars["drives_right"]

# Use dr to subset cars: sel
sel = cars[dr]

# Print sel
print(sel)


# Convert code to a one-liner
sel = cars[cars['drives_right']]

# Print sel
print(sel)


# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars["cars_per_cap"]
many_cars = cpc > 500
car_maniac = cars[many_cars]

# Print car_maniac
print(car_maniac)


# Create medium: observations with cars_per_cap between 100 and 500
cpc = cars['cars_per_cap']
between = np.logical_and(cpc > 100, cpc < 500)
medium = cars[between]

# Print medium
print(medium)
## False
## True
## False
## True
## False
## True
## True
## [ True  True False False]
## [False  True  True False]
## False
## True
## True
## [False  True False  True]
## [False False False  True]
## looking around in the kitchen.
## looking around in the kitchen.
## pretty small.
## looking around in the bedroom.
## medium size, nice!
##      cars_per_cap        country  drives_right
## US            809  United States          True
## RU            200         Russia          True
## MOR            70        Morocco          True
## EG             45          Egypt          True
##      cars_per_cap        country  drives_right
## US            809  United States          True
## RU            200         Russia          True
## MOR            70        Morocco          True
## EG             45          Egypt          True
##      cars_per_cap        country  drives_right
## US            809  United States          True
## AUS           731      Australia         False
## JAP           588          Japan         False
##     cars_per_cap country  drives_right
## RU           200  Russia          True

Chapter 4 - Loops

The while loop - alternative to the if/elif/else process:

  • The while loop will continue to execute as long as the condition is met
  • These loops are typically rare (and can easily cause an infinite loop), but can be powerful in certain circumstances
  • The syntax is while condition : expression # If expression is placed on the next line(s), then it should be indented by 4 spaces
  • CTRL-C will typically kill a Python infinte loop

The for loop - alternative to the while loop:

  • The basic syntax is for var in seq : expression # as per previous, if expression is on the next line(s), it should be indented by 4 spaces
    • The seq can be a list or a dictionary or the like, which will iterate by item in the list or disctionary or the like
  • Using a list, the enumerate() command will pull out a tuple which can be used in the iterations
    for a, b in enumerate(myList) : expression # a will be the index and b will be the value
  • If iterating over a string, the for loop will extract character by character

Looping data structures - Part I - extension to dictionaries, numpy arrays, and the like:

  • Looping through a dictionary requires calling the .items() method on the dictionary
    • for key, value in myDict.items() : expression # will extract key, value as tuples
    • Since dictionaries are unordered, the key will not be sorted and can come out in any order (hash-table dependent)
  • Looping through a 1-D numpy array will work the same as looping through a list; standard for-loop syntax
  • Looping through a 2-D numpy array will extract the 1-D numpy arrays underlying the 2-D numpy array (which may or may not be the desired output)
    • Alternately, using np.nditer(myNumpy2D) will extract the items one at a time

Looping data structures - Part II - extension to pandas DataFrame:

  • The basic expression for x in myPanda : expression # the x will just iterate across the column names
  • To extract the rows, use for lab, row in myPanda.iterrows() : expression # the lab will be the row name and the row will be the row data, iterated over all the rows
  • Rather than using a loop, the apply function can be used to create new columns in the panda
    • myPanda[“myNewCol”] = myPanda[“myOldCol”].apply(len) # will create new variable myNewCol as len(myOldCol)

Example code includes:


# Initialize offset
offset = 8

# Code the while loop
while offset != 0 :
    print("correcting...")
    offset = offset - 1
    print(offset)


# Initialize offset
offset = -6

# Code the while loop
while offset != 0 :
    print("correcting...")
    if offset > 0 :
        offset = offset - 1
    else :
        offset = offset + 1
    print(offset)


# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Code the for loop
for x in areas :
    print(x)


# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Change for loop to use enumerate()
for a, b in enumerate(areas) :
    print("room " + str(a) + ": " + str(b))


# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Code the for loop
for index, area in enumerate(areas) :
    print("room " + str(index + 1) + ": " + str(area))


# house list of lists
house = [["hallway", 11.25], 
         ["kitchen", 18.0], 
         ["living room", 20.0], 
         ["bedroom", 10.75], 
         ["bathroom", 9.50]]
         
# Build a for loop from scratch
for rooms in house :
    print("the " + str(rooms[0]) + " is " + str(rooms[1]) + " sqm")


# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn', 
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'australia':'vienna' }
          
# Iterate over europe
for country, capital in europe.items() :
    print("the capital of " + str(country) + " is " + str(capital))


# Import numpy as np
import numpy as np

# DO NOT HAVE EITHER DATASET
# Create np_height
height = np.round(np.random.normal(1.75, 0.20, 50), 2)  
np_height = np.array(height)

# Create np_baseball
# baseball = [180, 215, 210, 210, 188, 176, 209, 200]
# np_baseball = np.array(baseball)

weight = np.round(np.random.normal(60.32, 15, 50), 2)
np_baseball = np.column_stack((height, weight))


# For loop over np_height
for height in np_height :
    print(str(height) + " inches")

# The end= argument over-rides the default to move to a new line
# For loop over np_baseball
for item in np.nditer(np_baseball) :
    print(item, end=" ")


# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Iterate over rows of cars
for lab, dat in cars.iterrows() :
    print(lab)
    print(dat)


# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Adapt for loop
for lab, row in cars.iterrows() :
    print(lab + ": " + str(row['cars_per_cap']))


# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Code for loop that adds COUNTRY column
for lab, row in cars.iterrows() :
    cars.loc[lab, "COUNTRY"] = row['country'].upper()

# Print cars
print(cars)


# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Use .apply(str.upper)
cars["COUNTRY"] = cars["country"].apply(str.upper)
print(cars)
## correcting...
## 7
## correcting...
## 6
## correcting...
## 5
## correcting...
## 4
## correcting...
## 3
## correcting...
## 2
## correcting...
## 1
## correcting...
## 0
## correcting...
## -5
## correcting...
## -4
## correcting...
## -3
## correcting...
## -2
## correcting...
## -1
## correcting...
## 0
## 11.25
## 18.0
## 20.0
## 10.75
## 9.5
## room 0: 11.25
## room 1: 18.0
## room 2: 20.0
## room 3: 10.75
## room 4: 9.5
## room 1: 11.25
## room 2: 18.0
## room 3: 20.0
## room 4: 10.75
## room 5: 9.5
## the hallway is 11.25 sqm
## the kitchen is 18.0 sqm
## the living room is 20.0 sqm
## the bedroom is 10.75 sqm
## the bathroom is 9.5 sqm
## the capital of spain is madrid
## the capital of france is paris
## the capital of germany is bonn
## the capital of norway is oslo
## the capital of italy is rome
## the capital of poland is warsaw
## the capital of australia is vienna
## 1.71 inches
## 1.79 inches
## 1.31 inches
## 1.76 inches
## 1.86 inches
## 1.66 inches
## 1.9 inches
## 1.58 inches
## 1.66 inches
## 1.75 inches
## 1.79 inches
## 1.43 inches
## 1.98 inches
## 1.65 inches
## 1.84 inches
## 1.75 inches
## 1.61 inches
## 1.7 inches
## 1.65 inches
## 1.58 inches
## 1.9 inches
## 1.86 inches
## 1.52 inches
## 1.63 inches
## 1.45 inches
## 1.67 inches
## 1.73 inches
## 1.65 inches
## 1.69 inches
## 1.79 inches
## 1.57 inches
## 1.89 inches
## 2.03 inches
## 1.66 inches
## 1.7 inches
## 1.57 inches
## 1.73 inches
## 2.15 inches
## 1.94 inches
## 1.63 inches
## 1.92 inches
## 1.74 inches
## 1.93 inches
## 1.45 inches
## 1.3 inches
## 1.67 inches
## 1.74 inches
## 1.94 inches
## 1.73 inches
## 2.03 inches
## 1.71 65.93 1.79 68.6 1.31 55.46 1.76 66.15 1.86 76.35 1.66 50.24 1.9 57.56 1.58 63.67 1.66 68.66 1.75 59.61 1.79 33.5 1.43 54.03 1.98 57.63 1.65 69.76 1.84 94.77 1.75 49.72 1.61 41.84 1.7 68.6 1.65 44.78 1.58 83.44 1.9 67.0 1.86 59.9 1.52 65.31 1.63 87.8 1.45 57.19 1.67 100.88 1.73 55.25 1.65 50.76 1.69 76.63 1.79 52.51 1.57 67.38 1.89 56.82 2.03 52.05 1.66 57.64 1.7 74.83 1.57 47.78 1.73 65.63 2.15 55.28 1.94 58.16 1.63 55.27 1.92 43.57 1.74 77.25 1.93 64.13 1.45 23.82 1.3 67.15 1.67 76.92 1.74 41.49 1.94 62.84 1.73 75.77 2.03 59.92 US
## cars_per_cap              809
## country         United States
## drives_right             True
## Name: US, dtype: object
## AUS
## cars_per_cap          731
## country         Australia
## drives_right        False
## Name: AUS, dtype: object
## JAP
## cars_per_cap      588
## country         Japan
## drives_right    False
## Name: JAP, dtype: object
## IN
## cars_per_cap       18
## country         India
## drives_right    False
## Name: IN, dtype: object
## RU
## cars_per_cap       200
## country         Russia
## drives_right      True
## Name: RU, dtype: object
## MOR
## cars_per_cap         70
## country         Morocco
## drives_right       True
## Name: MOR, dtype: object
## EG
## cars_per_cap       45
## country         Egypt
## drives_right     True
## Name: EG, dtype: object
## US: 809
## AUS: 731
## JAP: 588
## IN: 18
## RU: 200
## MOR: 70
## EG: 45
##      cars_per_cap        country  drives_right        COUNTRY
## US            809  United States          True  UNITED STATES
## AUS           731      Australia         False      AUSTRALIA
## JAP           588          Japan         False          JAPAN
## IN             18          India         False          INDIA
## RU            200         Russia          True         RUSSIA
## MOR            70        Morocco          True        MOROCCO
## EG             45          Egypt          True          EGYPT
##      cars_per_cap        country  drives_right        COUNTRY
## US            809  United States          True  UNITED STATES
## AUS           731      Australia         False      AUSTRALIA
## JAP           588          Japan         False          JAPAN
## IN             18          India         False          INDIA
## RU            200         Russia          True         RUSSIA
## MOR            70        Morocco          True        MOROCCO
## EG             45          Egypt          True          EGYPT

Chapter 5 - Case Study: Hacker Statistics

Random numbers - random walk using a 6-sided dice where 1/2 means -1, 3/4/5 means +1, and 6 means roll again and go up the number of the next roll:

  • Further, set a floor of step 0 and also add a 0.1% chance of falling down the stairs (presumably reverting to floor 0) at any given move
  • Assume that this is a 100-move game, and assess the odds of ending at floor 60+
  • Hacker statistics is simulating the game to assess the probabilities, as opposed to solving the game analytically
  • The np.random.rand() function will return a random number between 0 and 1
    • Calling np.random.seed(mySeed) will set the seed for the upcoming trials, useful for reproducibility
    • Calling np.random.int(a, b) will generate random integers between a and b-1 (both inclusive) assuming equal probabilities

Random walk - well-known pattern in science:

  • Initializing an empty list can be done with the square brackets; myEmptyList = []
    • Appending items to the list can be done with myEmptyList.append(myEntry)
  • For the random walk, start at 0 making myList = [0]
    • Then, can run for x in range(runs) : myList.append(myList[x] + myRandom)
    • Note that range(x) will generate an integer list from 0 to x-1

Distribution of random walks - expanding on the 100-trial random walk:

  • Simulating many times allows for building a distribution and then making calculations based on that distribution

Example code includes:


# Import numpy as np
import numpy as np

# Set the seed
np.random.seed(123)

# Generate and print random float
print(np.random.rand())


# Import numpy and set seed
import numpy as np
np.random.seed(123)

# Use randint() to simulate a dice
print(np.random.randint(1, 7))

# Use randint() again
print(np.random.randint(1, 7))


# Import numpy and set seed
import numpy as np
np.random.seed(123)

# Starting step
step = 50

# Roll the dice
dice = np.random.randint(1, 7)

# Finish the control construct
if dice <= 2 :
    step = step - 1
elif dice < 6 :
    step = step + 1
else :
    step = step + np.random.randint(1,7)

# Print out dice and step
print(dice)
print(step)


# Import numpy and set seed
import numpy as np
np.random.seed(123)

# Initialize random_walk
random_walk = [0]

# Complete the ___
for x in range(100) :
    # Set step: last element in random_walk
    step = random_walk[-1]

    # Roll the dice
    dice = np.random.randint(1,7)

    # Determine next step
    if dice <= 2:
        step = step - 1
    elif dice <= 5:
        step = step + 1
    else:
        step = step + np.random.randint(1,7)

    # append next_step to random_walk
    random_walk.append(step)

# Print random_walk
print(random_walk)


# Import numpy and set seed
import numpy as np
np.random.seed(123)

# Initialize random_walk
random_walk = [0]

for x in range(100) :
    step = random_walk[-1]
    dice = np.random.randint(1,7)

    if dice <= 2:
        # Replace below: use max to make sure step can't go below 0
        step = max(0, step - 1)
    elif dice <= 5:
        step = step + 1
    else:
        step = step + np.random.randint(1,7)

    random_walk.append(step)

print(random_walk)


# Initialization
import numpy as np
np.random.seed(123)
random_walk = [0]

for x in range(100) :
    step = random_walk[-1]
    dice = np.random.randint(1,7)

    if dice <= 2:
        step = max(0, step - 1)
    elif dice <= 5:
        step = step + 1
    else:
        step = step + np.random.randint(1,7)

    random_walk.append(step)

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Plot random_walk
plt.plot(random_walk)

# Show the plot
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy015.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting


# Initialization
import numpy as np
np.random.seed(123)

# Initialize all_walks
all_walks = []

# Simulate random walk 10 times
for i in range(10) :

    # Code from before
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)

        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        random_walk.append(step)

    # Append random_walk to all_walks
    all_walks.append(random_walk)

# Print all_walks
print(all_walks)


import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)
all_walks = []
for i in range(10) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        random_walk.append(step)
    all_walks.append(random_walk)

# Convert all_walks to Numpy array: np_aw
np_aw = np.array(all_walks)

# Plot np_aw and show
plt.plot(np_aw)
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy016.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting


# Transpose np_aw: np_aw_t
np_aw_t = np.transpose(np_aw)

# Plot np_aw_t and show
plt.plot(np_aw_t)
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy017.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting


import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)
all_walks = []

# Simulate random walk 250 times
for i in range(250) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)

        # Implement clumsiness
        if np.random.rand() <= 0.001 :
            step = 0

        random_walk.append(step)
    all_walks.append(random_walk)

# Create and plot np_aw_t
np_aw_t = np.transpose(np.array(all_walks))
plt.plot(np_aw_t)
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy018.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting


import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)
all_walks = []

# Simulate random walk 500 times
for i in range(500) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        if np.random.rand() <= 0.001 :
            step = 0
        random_walk.append(step)
    all_walks.append(random_walk)

# Create and plot np_aw_t
np_aw_t = np.transpose(np.array(all_walks))

# Select last row from np_aw_t: ends
ends = np_aw_t[-1]

# Plot histogram of ends, display plot
plt.hist(ends)
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy019.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting
## 0.6964691855978616
## 6
## 3
## 6
## 53
## [0, 3, 4, 5, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1, 0, -1, 0, 5, 4, 3, 4, 3, 4, 5, 6, 7, 8, 7, 8, 7, 8, 9, 10, 11, 10, 14, 15, 14, 15, 14, 15, 16, 17, 18, 19, 20, 21, 24, 25, 26, 27, 32, 33, 37, 38, 37, 38, 39, 38, 39, 40, 42, 43, 44, 43, 42, 43, 44, 43, 42, 43, 44, 46, 45, 44, 45, 44, 45, 46, 47, 49, 48, 49, 50, 51, 52, 53, 52, 51, 52, 51, 52, 53, 52, 55, 56, 57, 58, 57, 58, 59]
## [0, 3, 4, 5, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1, 0, 0, 1, 6, 5, 4, 5, 4, 5, 6, 7, 8, 9, 8, 9, 8, 9, 10, 11, 12, 11, 15, 16, 15, 16, 15, 16, 17, 18, 19, 20, 21, 22, 25, 26, 27, 28, 33, 34, 38, 39, 38, 39, 40, 39, 40, 41, 43, 44, 45, 44, 43, 44, 45, 44, 43, 44, 45, 47, 46, 45, 46, 45, 46, 47, 48, 50, 49, 50, 51, 52, 53, 54, 53, 52, 53, 52, 53, 54, 53, 56, 57, 58, 59, 58, 59, 60]
## [[0, 3, 4, 5, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1, 0, 0, 1, 6, 5, 4, 5, 4, 5, 6, 7, 8, 9, 8, 9, 8, 9, 10, 11, 12, 11, 15, 16, 15, 16, 15, 16, 17, 18, 19, 20, 21, 22, 25, 26, 27, 28, 33, 34, 38, 39, 38, 39, 40, 39, 40, 41, 43, 44, 45, 44, 43, 44, 45, 44, 43, 44, 45, 47, 46, 45, 46, 45, 46, 47, 48, 50, 49, 50, 51, 52, 53, 54, 53, 52, 53, 52, 53, 54, 53, 56, 57, 58, 59, 58, 59, 60], [0, 4, 3, 2, 4, 3, 4, 6, 7, 8, 13, 12, 13, 14, 15, 16, 17, 16, 21, 22, 23, 24, 23, 22, 21, 20, 19, 20, 21, 22, 28, 27, 26, 25, 26, 27, 28, 27, 28, 29, 28, 33, 34, 33, 32, 31, 30, 31, 30, 29, 31, 32, 35, 36, 38, 39, 40, 41, 40, 39, 40, 41, 42, 43, 42, 43, 44, 45, 48, 49, 50, 49, 50, 49, 50, 51, 52, 56, 55, 54, 55, 56, 57, 56, 57, 56, 57, 59, 64, 63, 64, 65, 66, 67, 68, 69, 68, 69, 70, 71, 73], [0, 2, 1, 2, 3, 6, 5, 6, 5, 6, 7, 8, 7, 8, 7, 8, 9, 11, 10, 9, 10, 11, 10, 12, 13, 14, 15, 16, 17, 18, 17, 18, 19, 24, 25, 24, 23, 22, 21, 22, 23, 24, 29, 30, 29, 30, 31, 32, 33, 34, 35, 34, 33, 34, 33, 39, 38, 39, 38, 39, 38, 39, 43, 47, 49, 51, 50, 51, 53, 52, 58, 59, 61, 62, 61, 62, 63, 64, 63, 64, 65, 66, 68, 67, 66, 67, 73, 78, 77, 76, 80, 81, 82, 83, 85, 84, 85, 84, 85, 84, 83], [0, 6, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12, 13, 12, 11, 12, 11, 12, 11, 12, 13, 17, 18, 17, 23, 22, 21, 22, 21, 20, 21, 20, 24, 23, 24, 23, 24, 23, 24, 26, 25, 24, 23, 24, 23, 28, 29, 30, 29, 28, 29, 28, 29, 28, 33, 34, 33, 32, 31, 30, 31, 32, 36, 42, 43, 44, 45, 46, 45, 46, 48, 49, 50, 51, 50, 49, 50, 49, 50, 51, 52, 51, 52, 53, 54, 53, 52, 53, 54, 59, 60, 61, 66, 65, 66, 65, 66, 67, 68, 69, 68], [0, 6, 5, 6, 5, 4, 5, 9, 10, 11, 12, 13, 12, 11, 10, 9, 8, 9, 10, 11, 12, 13, 14, 13, 14, 15, 14, 15, 16, 19, 18, 19, 18, 19, 22, 23, 24, 25, 24, 23, 26, 27, 28, 29, 28, 27, 28, 31, 32, 37, 38, 37, 38, 37, 38, 37, 43, 42, 41, 42, 44, 43, 42, 41, 42, 43, 44, 45, 49, 54, 55, 56, 57, 60, 61, 62, 63, 64, 65, 66, 65, 64, 65, 66, 65, 71, 70, 71, 72, 71, 70, 71, 70, 69, 75, 74, 73, 74, 75, 74, 73], [0, 0, 0, 1, 7, 8, 11, 12, 18, 19, 20, 26, 25, 31, 30, 31, 32, 33, 32, 38, 39, 38, 39, 38, 39, 38, 39, 38, 39, 43, 44, 46, 45, 46, 45, 44, 45, 44, 45, 44, 48, 52, 51, 50, 49, 50, 51, 55, 56, 57, 61, 60, 59, 58, 59, 60, 62, 61, 60, 61, 62, 64, 67, 72, 73, 72, 73, 74, 75, 76, 77, 76, 77, 78, 84, 83, 88, 87, 91, 90, 94, 93, 96, 97, 96, 97, 103, 102, 101, 100, 104, 103, 102, 103, 104, 103, 104, 105, 106, 107, 106], [0, 0, 0, 1, 0, 0, 4, 5, 7, 11, 17, 16, 15, 16, 17, 18, 17, 18, 17, 18, 19, 18, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 33, 32, 35, 36, 35, 34, 35, 36, 37, 36, 35, 34, 33, 34, 35, 36, 37, 38, 39, 40, 39, 40, 41, 43, 42, 43, 44, 47, 49, 50, 49, 48, 47, 46, 45, 46, 45, 46, 48, 49, 50, 49, 50, 49, 48, 49, 48, 47, 46, 47, 46, 45, 46, 47, 48, 50, 51, 52, 51, 50, 51, 57, 56, 57, 58, 63, 62, 63], [0, 0, 1, 2, 1, 2, 3, 9, 10, 11, 12, 11, 13, 14, 15, 16, 15, 16, 17, 18, 19, 18, 19, 18, 19, 20, 19, 20, 24, 25, 28, 29, 33, 34, 33, 34, 35, 34, 33, 38, 39, 40, 39, 38, 39, 40, 41, 40, 44, 43, 44, 45, 46, 47, 48, 49, 50, 49, 48, 47, 48, 49, 53, 54, 53, 54, 55, 54, 60, 61, 62, 63, 62, 63, 64, 67, 66, 67, 66, 65, 64, 65, 66, 68, 69, 70, 74, 75, 74, 73, 74, 75, 74, 73, 74, 75, 76, 75, 74, 75, 76], [0, 1, 0, 1, 2, 1, 0, 0, 1, 2, 3, 4, 5, 10, 14, 13, 14, 13, 12, 11, 12, 11, 12, 13, 12, 16, 17, 16, 17, 16, 15, 16, 15, 19, 20, 21, 22, 23, 24, 23, 24, 25, 26, 27, 28, 27, 32, 33, 34, 33, 34, 33, 34, 35, 34, 35, 40, 41, 42, 41, 42, 43, 44, 43, 44, 43, 44, 45, 44, 43, 42, 43, 44, 43, 42, 41, 42, 46, 47, 48, 49, 50, 51, 50, 51, 52, 51, 52, 57, 58, 57, 56, 57, 56, 55, 54, 58, 59, 60, 61, 60], [0, 1, 2, 3, 4, 5, 4, 3, 6, 5, 4, 3, 2, 3, 9, 10, 9, 10, 11, 10, 9, 10, 11, 12, 11, 15, 16, 15, 17, 18, 17, 18, 19, 20, 21, 22, 23, 22, 21, 22, 23, 22, 23, 24, 23, 22, 21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 33, 34, 35, 36, 37, 38, 37, 36, 42, 43, 44, 43, 42, 41, 45, 46, 50, 49, 55, 56, 57, 61, 62, 61, 60, 61, 62, 63, 64, 63, 69, 70, 69, 73, 74, 73, 74, 73, 79, 85, 86, 85, 86, 87]]

Single random walk:

10 full walks:

10 full walks transposed:

250 random walks with “clumsiness”:

500 random walks with “clumsiness”:


Python Data Science Toolbox (Part I)

Chapter 1 - Writing your own functions

User-defined functions - with/without parameters, and with/without returning values:

  • The general syntax is def myFunction(myParams) : commands # commands can be on additional lines provided that they are indented by 4 spaces
  • Parameters are defined in the function body, while arguments are passed to the function; so def myFunction(myParams) has a parameter while myFunction(myArg) provides an argument
  • If the command return myReturn is included in the function, then myReturn will be passed back to the main body of the code and the function will stop and return
  • Docstrings serve as documentation for a function, and are included in the first line after the function call, surrounded by triple quotes (“”“myComment”“”)

Multiple parameters and return values:

  • Functions can accept multiple arguments, such as def myFunction(myParam1, myParam2)
  • Can also return multiple values using tuples (like a list, but immutable and defined using parentheses rather than square brackets)
  • Tuples can be unpacked using just variables separated by commans, so for example:
    • myTuple = (2, 4, 6)
    • a, b, c = myTuple # a will be 2, b will be 4, c will be 6
    • myTuple[1] will be 4, so the tuples can be accessed by way of an index

Bringing it all together - practical examples using Twitter data:

  • Function header, function body including a docstring, and (optionally) function returns

Example code includes:


# Define the function shout
def shout():
    """Print a string with three exclamation marks"""
    # Concatenate the strings: shout_word
    shout_word = "congratulations" + "!!!"
    
    # Print shout_word
    print(shout_word)

# Call shout
shout()


# Define shout with the parameter, word
def shout(word):
    """Print a string with three exclamation marks"""
    # Concatenate the strings: shout_word
    shout_word = word + '!!!'

    # Print shout_word
    print(shout_word)

# Call shout with the string 'congratulations'
shout("congratulations")


# Define shout with the parameter, word
def shout(word):
    """Return a string with three exclamation marks"""
    # Concatenate the strings: shout_word
    shout_word = word + "!!!"

    # Replace print with return
    return(shout_word)

# Pass 'congratulations' to shout: yell
yell = shout("congratulations")

# Print yell
print(yell)


# Define shout with parameters word1 and word2
def shout(word1, word2):
    """Concatenate strings with three exclamation marks"""
    # Concatenate word1 with '!!!': shout1
    shout1 = word1 + "!!!"
    
    # Concatenate word2 with '!!!': shout2
    shout2 = word2 + "!!!"
    
    # Concatenate shout1 with shout2: new_shout
    new_shout = shout1 + shout2

    # Return new_shout
    return new_shout

# Pass 'congratulations' and 'you' to shout(): yell
yell = shout("congratulations", "you")

# Print yell
print(yell)


# Set up the nums tuple for later access
nums = (3, 4, 6)

# Unpack nums into num1, num2, and num3
num1, num2, num3 = nums

# Construct even_nums
even_nums = (2, num2, num3)


# Define shout_all with parameters word1 and word2
def shout_all(word1, word2):
    
    # Concatenate word1 with '!!!': shout1
    shout1 = word1 + "!!!"
    
    # Concatenate word2 with '!!!': shout2
    shout2 = word2 + "!!!"
    
    # Construct a tuple with shout1 and shout2: shout_words
    shout_words = (shout1, shout2)

    # Return shout_words
    return shout_words

# Pass 'congratulations' and 'you' to shout_all(): yell1, yell2
yell1, yell2 = shout_all("congratulations", "you")

# Print yell1 and yell2
print(yell1)
print(yell2)


# Import pandas
import pandas as pd

# DO NOT HAVE THIS CSV; CAN JUST MAKE A COLUMN WITH A SINGLE WORD FOR THE EXAMPLE
# Import Twitter data as DataFrame: df
df = pd.read_csv("tweets.csv")

# Initialize an empty dictionary: langs_count
langs_count = {}

# Extract column from DataFrame: col
col = df['lang']

# Iterate over lang column in DataFrame
for entry in col:

    # If the language is in langs_count, add 1
    if entry in langs_count.keys():
        langs_count[entry] = langs_count[entry] + 1
    # Else add the language to langs_count, set the value to 1
    else:
        langs_count[entry] = 1

# Print the populated dictionary
print(langs_count)


# Define count_entries()
def count_entries(df, col_name):
    """Return a dictionary with counts of 
    occurrences as value for each key."""

    # Initialize an empty dictionary: langs_count
    langs_count = {}
    
    # Extract column from DataFrame: col
    col = df[col_name]
    
    # Iterate over lang column in DataFrame
    for entry in col:

        # If the language is in langs_count, add 1
        if entry in langs_count.keys():
            langs_count[entry] = langs_count[entry] + 1
        # Else add the language to langs_count, set the value to 1
        else:
            langs_count[entry] = 1

    # Return the langs_count dictionary
    return(langs_count)

# NEED TO CREATE tweets_df such that it contains a column 'lang'
# Call count_entries(): result
tweets_df = df
result = count_entries(tweets_df, "lang")

# Print the result
print(result)
## congratulations!!!
## congratulations!!!
## congratulations!!!
## congratulations!!!you!!!
## congratulations!!!
## you!!!
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}

Chapter 2 - Default arguments and variable-length arguments

Scope (where are objects or names accessible) and user-defined functions:

  • Global scope - defined in the main body of the script
  • Local scope - defined inside a function (once the function ends, the names and objects from the function disappear)
  • Built-in scope - names in the pre-defined built-ins module (e.g., print)
    • To access the builtins, type “import builtins” followed by “dir(builtins)” - long story as to why
  • Search path for a name/object is local scope, then global scope, then built-in scope
  • Can use the key word “global” within a function to access the global scope, meaning that variable will be edited in the global scope, not in the local scope
    • def square(value) : global new_val ; new_val = new_val ** 2; return(new_ val)
    • new_val = 10 ; square(3) # returns 100, but now the global variable new_val is also 100

Nested functions - one function defined inside another function:

  • With nested functions, the search is first local, then to the enclosing function, then to the global scope, then to the builtins
  • The inner function can be helpful if the outer function will need to repeat certain actions to achieve its objectives
  • Can return an inner function as the output of a function (example being raising to a user-specified power)
  • There is a computer-science term “closure” that defines exactly how the scopes work during this process
    • Per DataCamp, “One other pretty cool reason for nesting functions is the idea of a closure. This means that the nested or inner function remembers the state of its enclosing scope when called.”
    • Continuing the DataCamp quote “Thus, anything defined locally in the enclosing scope is available to the inner function even when the outer function has finished execution.”
  • The keyword “nonlocal” is available for changing names/values in the enclosing scope (not the global scope; that is keyword “global”)

Default and flexible arguments - arguments used when they are not specified, or when a flexible number of arguments can be passed:

  • The default arguments are defined using the equal sign, same as R (can be over-ridden if passed by the user, otherwise the default value will be used)
  • Using a parameter *args (anything with a single star) will create a tuple called “args” out of whatever the user-passed (1 or more arguments)
    • It appears from the example that Python has the += command (as well as -=, *= and /=)
  • Using parameter **kwargs (anything with a double-star) will create a dictionary kwargs with key, value pairs off whatever the user has entered

Bringing it all together - case study on processing a data frame to get word counts, defaulted to column ‘lang’:

  • Objective is to further generalize the process to be able to work on any number (arbitrary, user-specified) of columns in the DataFrame

Example code includes:


# Create a string: team
team = "teen titans"

# Define change_team()
def change_team():
    """Change the value of the global variable team."""

    # Use team in global scope
    global team

    # Change the value of team in global: team
    team = "justice league"

# Print team
print(team)

# Call change_team()
change_team()

# Print team
print(team)


# Define three_shouts
def three_shouts(word1, word2, word3):
    """Returns a tuple of strings
    concatenated with '!!!'."""

    # Define inner
    def inner(word):
        """Returns a string concatenated with '!!!'."""
        return word + '!!!'

    # Return a tuple of strings
    return (inner(word1), inner(word2), inner(word3))

# Call three_shouts() and print
print(three_shouts('a', 'b', 'c'))


# Define echo
def echo(n):
    """Return the inner_echo function."""

    # Define inner_echo
    def inner_echo(word1):
        """Concatenate n copies of word1."""
        echo_word = word1 * n
        return echo_word

    # Return inner_echo
    return inner_echo

# Call echo: twice
twice = echo(2)

# Call echo: thrice
thrice = echo(3)

# Call twice() and thrice() then print
print(twice('hello'), thrice('hello'))


# Define echo_shout()
def echo_shout(word):
    """Change the value of a nonlocal variable"""
    
    # Concatenate word with itself: echo_word
    echo_word = word + word
    
    #Print echo_word
    print(echo_word)
    
    # Define inner function shout()
    def shout():
        """Alter a variable in the enclosing scope"""    
        #Use echo_word in nonlocal scope
        nonlocal echo_word
        
        #Change echo_word to echo_word concatenated with '!!!'
        echo_word = echo_word + "!!!"
    
    # Call function shout()
    shout()
    
    #Print echo_word
    print(echo_word)

#Call function echo_shout() with argument 'hello'    
echo_shout("hello")


# Define shout_echo
def shout_echo(word1, echo=1):
    """Concatenate echo copies of word1 and three
     exclamation marks at the end of the string."""

    # Concatenate echo copies of word1 using *: echo_word
    echo_word = word1 * echo

    # Concatenate '!!!' to echo_word: shout_word
    shout_word = echo_word + '!!!'

    # Return shout_word
    return shout_word

# Call shout_echo() with "Hey": no_echo
no_echo = shout_echo("Hey")

# Call shout_echo() with "Hey" and echo=5: with_echo
with_echo = shout_echo("Hey", 5)

# Print no_echo and with_echo
print(no_echo)
print(with_echo)


# Define shout_echo
def shout_echo(word1, echo=1, intense=False):
    """Concatenate echo copies of word1 and three
    exclamation marks at the end of the string."""

    # Concatenate echo copies of word1 using *: echo_word
    echo_word = word1 * echo

    # Capitalize echo_word if intense is True
    if intense is True:
        # Capitalize and concatenate '!!!': echo_word_new
        echo_word_new = echo_word.upper() + '!!!'
    else:
        # Concatenate '!!!' to echo_word: echo_word_new
        echo_word_new = echo_word + '!!!'

    # Return echo_word_new
    return echo_word_new

# Call shout_echo() with "Hey", echo=5 and intense=True: with_big_echo
with_big_echo = shout_echo("Hey", 5, True)

# Call shout_echo() with "Hey" and intense=True: big_no_echo
big_no_echo = shout_echo("Hey", intense=True)

# Print values
print(with_big_echo)
print(big_no_echo)


# Define gibberish
def gibberish(*args):
    """Concatenate strings in *args together."""

    # Initialize an empty string: hodgepodge
    hodgepodge = ""

    # Concatenate the strings in args
    for word in args:
        hodgepodge += word

    # Return hodgepodge
    return(hodgepodge)

# Call gibberish() with one string: one_word
one_word = gibberish("luke")

# Call gibberish() with five strings: many_words
many_words = gibberish("luke", "leia", "han", "obi", "darth")

# Print one_word and many_words
print(one_word)
print(many_words)


# Define report_status
def report_status(**kwargs):
    """Print out the status of a movie character."""

    print("\nBEGIN: REPORT\n")

    # Iterate over the key-value pairs of kwargs
    for key, value in kwargs.items():
        # Print out the keys and values, separated by a colon ':'
        print(key + ": " + value)

    print("\nEND REPORT")

# First call to report_status()
report_status(name="luke", affiliation="jedi", status="missing")

# Second call to report_status()
report_status(name="anakin", affiliation="sith lord", status="deceased")


# DO NOT HAVE file tweets_df (may need to create some dummy data . . . )
import pandas as pd
tweets_df = pd.read_csv("tweets.csv")


# Define count_entries()
def count_entries(df, col_name="lang"):
    """Return a dictionary with counts of
    occurrences as value for each key."""

    # Initialize an empty dictionary: cols_count
    cols_count = {}

    # Extract column from DataFrame: col
    col = df[col_name]
    
    # Iterate over the column in DataFrame
    for entry in col:

        # If entry is in cols_count, add 1
        if entry in cols_count.keys():
            cols_count[entry] += 1

        # Else add the entry to cols_count, set the value to 1
        else:
            cols_count[entry] = 1

    # Return the cols_count dictionary
    return cols_count

# Call count_entries(): result1
result1 = count_entries(tweets_df)

# Call count_entries(): result2
result2 = count_entries(tweets_df, "source")

# Print result1 and result2
print(result1)
print(result2)


# Define count_entries()
def count_entries(df, *args):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    
    #Initialize an empty dictionary: cols_count
    cols_count = {}
    
    # Iterate over column names in args
    for col_name in args:
    
        # Extract column from DataFrame: col
        col = df[col_name]
    
        # Iterate over the column in DataFrame
        for entry in col:
    
            # If entry is in cols_count, add 1
            if entry in cols_count.keys():
                cols_count[entry] += 1
    
            # Else add the entry to cols_count, set the value to 1
            else:
                cols_count[entry] = 1

    # Return the cols_count dictionary
    return cols_count

# Call count_entries(): result1
result1 = count_entries(tweets_df, "lang")

# Call count_entries(): result2
result2 = count_entries(tweets_df, "lang", "source")

# Print result1 and result2
print(result1)
print(result2)
## teen titans
## justice league
## ('a!!!', 'b!!!', 'c!!!')
## hellohello hellohellohello
## hellohello
## hellohello!!!
## Hey!!!
## HeyHeyHeyHeyHey!!!
## HEYHEYHEYHEYHEY!!!
## HEY!!!
## luke
## lukeleiahanobidarth
## 
## BEGIN: REPORT
## 
## name: luke
## affiliation: jedi
## status: missing
## 
## END REPORT
## 
## BEGIN: REPORT
## 
## name: anakin
## affiliation: sith lord
## status: deceased
## 
## END REPORT
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
## {'C': 60, 'A': 57, 'D': 35, 'B': 48}
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18, 'C': 60, 'A': 57, 'D': 35, 'B': 48}

Chapter 3 - Lambda functions and error handling

Lambda functions - quicker way to write functions on the fly:

  • The general syntax is lambda : expression
    • raise_to_power = lambda x, y : x ** y
    • raise_to_power(2, 3) = 8 # runs as 2 ** 3
  • The lambda function is “quick and dirty”, so it should be limited to key areas where that is appropriate
  • An example is map(func, seq) which will apply the function over all elements of the sequence
    • The lambda function can be valuable here, since it allows for a custom function to be applied quickly across a sequence
    • square_all = map(lambda num: num ** 2, nums)
    • Need to use print(list(square_all)) since print(square_all) will just define that it is an object at a designated point in memory

Introduction to error handling - functions generally return an error if something is wrong, though that can be trapped/over-ridden:

  • Endeavor to provide useful error messages rather than just a trace-back default from Python
  • The typical approach in Python is try-except, where try will try the command and except will run if the try produces an error
    • try : command to try # The commands to try are typically on a new line(s) and indented by 4 spaces
    • except : do otherwise # The except lines up with the try, while the do otherwise are typically on a new line(s) and indented by 4 spaces
  • Can add types of errors to be trapped, for example “type errors only” using except TypeError :
  • To generate an error (for example, if negative inputs are not desired), can use raise ValueError(“quotedMessage”)
    • This will throw a trace-back message, with the quotedMessage appearing at the bottom

Bringing it all together:

  • Case study for error handling on the tweets data frame

Example code includes:


# Define echo_word as a lambda function: echo_word
echo_word = (lambda word1, echo : word1 * echo)

# Call echo_word: result
result = echo_word("hey", 5)

# Print result
print(result)


# Create a list of strings: spells
spells = ["protego", "accio", "expecto patronum", "legilimens"]

# Use map() to apply a lambda function over spells: shout_spells
shout_spells = map(lambda a : a + "!!!", spells)

# Convert shout_spells to a list: shout_spells_list
shout_spells_list = list(shout_spells)

# Convert shout_spells into a list and print it
print(shout_spells_list)


# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# Use filter() to apply a lambda function over fellowship: result
result = filter(lambda a : len(a) > 6, fellowship)

# Convert result to a list: result_list
result_list = list(result)

# Convert result into a list and print it
print(result_list)


# Import reduce from functools
from functools import reduce

# Create a list of strings: stark
stark = ['robb', 'sansa', 'arya', 'eddard', 'jon']

# Use reduce() to apply a lambda function over stark: result
result = reduce(lambda item1, item2 : item1 + item2, stark)

# Print the result
print(result)


# Define shout_echo
def shout_echo(word1, echo=1):
    """Concatenate echo copies of word1 and three
    exclamation marks at the end of the string."""

    # Initialize empty strings: echo_word, shout_words
    echo_word = ""
    shout_words = ""

    # Add exception handling with try-except
    try:
        # Concatenate echo copies of word1 using *: echo_word
        echo_word = word1 * echo

        # Concatenate '!!!' to echo_word: shout_words
        shout_words = echo_word + "!!!"
    except:
        # Print error message
        print("word1 must be a string and echo must be an integer.")

    # Return shout_words
    return shout_words

# Call shout_echo
shout_echo("particle", echo="accelerator")


# Define shout_echo
def shout_echo(word1, echo=1):
    """Concatenate echo copies of word1 and three
    exclamation marks at the end of the string."""

    # Raise an error with raise
    if echo < 0:
        raise ValueError('echo must be greater than 0')

    # Concatenate echo copies of word1 using *: echo_word
    echo_word = word1 * echo

    # Concatenate '!!!' to echo_word: shout_word
    shout_word = echo_word + '!!!'

    # Return shout_word
    return shout_word

# Call shout_echo
shout_echo("particle", echo=5)


# DO NOT HAVE file tweets_df (made "tweets.csv" using R)
import pandas as pd
tweets_df = pd.read_csv("tweets.csv")

# Select retweets from the Twitter DataFrame: result
result = filter(lambda x : x[0:2] == "RT", tweets_df["text"])

# Create list from filter object result: res_list
res_list = list(result)

# Print all retweets in res_list
for tweet in res_list:
    print(tweet)


# Define count_entries()
def count_entries(df, col_name='lang'):
    """Return a dictionary with counts of
    occurrences as value for each key."""

    # Initialize an empty dictionary: cols_count
    cols_count = {}

    # Add try block
    try:
        # Extract column from DataFrame: col
        col = df[col_name]
        
        # Iterate over the column in dataframe
        for entry in col:
    
            # If entry is in cols_count, add 1
            if entry in cols_count.keys():
                cols_count[entry] += 1
            # Else add the entry to cols_count, set the value to 1
            else:
                cols_count[entry] = 1
    
        # Return the cols_count dictionary
        return cols_count

    # Add except block
    except:
        print('The DataFrame does not have a ' + col_name + ' column.')

# DO NOT HAVE file tweets_df
# Call count_entries(): result1
result1 = count_entries(tweets_df, 'lang')

# Print result1
print(result1)

# Call count_entries(): result2
result2 = count_entries(tweets_df, 'lang1')


# Define count_entries()
def count_entries(df, col_name='lang'):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    
    # Raise a ValueError if col_name is NOT in DataFrame
    if col_name not in df.columns:
        raise ValueError('The DataFrame does not have a ' + col_name + ' column.')

    # Initialize an empty dictionary: cols_count
    cols_count = {}
    
    # Extract column from DataFrame: col
    col = df[col_name]
    
    # Iterate over the column in DataFrame
    for entry in col:

        # If entry is in cols_count, add 1
        if entry in cols_count.keys():
            cols_count[entry] += 1
            # Else add the entry to cols_count, set the value to 1
        else:
            cols_count[entry] = 1
        
        # Return the cols_count dictionary
    return cols_count

# Call count_entries(): result1
result1 = count_entries(tweets_df, "lang")

# Print result1
print(result1)


# CAREFUL, THIS ONE IS DESIGNED TO RAISE THE ERROR!
# count_entries(tweets_df, 'lang1')
## heyheyheyheyhey
## ['protego!!!', 'accio!!!', 'expecto patronum!!!', 'legilimens!!!']
## ['samwise', 'aragorn', 'legolas', 'boromir']
## robbsansaaryaeddardjon
## word1 must be a string and echo must be an integer.
## RT H
## RT F
## RT H
## RT H
## RT G
## RT G
## RT G
## RT E
## RT E
## RT E
## RT H
## RT E
## RT F
## RT G
## RT G
## RT E
## RT G
## RT E
## RT E
## RT H
## RT G
## RT E
## RT G
## RT G
## RT F
## RT H
## RT H
## RT E
## RT H
## RT G
## RT F
## RT F
## RT F
## RT H
## RT G
## RT E
## RT E
## RT H
## RT F
## RT G
## RT H
## RT E
## RT H
## RT G
## RT F
## RT E
## RT F
## RT E
## RT E
## RT H
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
## The DataFrame does not have a lang1 column.
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}

Python Data Science Toolbox (Part II)

Chapter 1 - Using iterators in PythonLand

Introduction to iterators - for loops and the like:

  • For loops can be used for iterating over strings, lists, dictionaries, range() objects, and the like
  • Anything that can be looped over is called an “iterable”, and will have an associated iter() method
  • An “iterator” is something that produces the next value with a next() call
  • For loops are using the “iterable” property of objects under-the-hood, with an associated (if silent) “next” call
    • word = “Da”; it = iter(word); next(it); next(it) # “D” then “a”
  • The “star” (*) operator will impact all elements of an iterator at once
    • word = “Data”; it = iter(word); print(*it) # single-line of “D” “a” “t” “a”
    • Note that if print(*it) were then called again, there would just be a blank line; there is nothing left to iterate over
  • To unpack a dictionary, use myDict.items()
  • To iterate over a file, use file=open(“file.txt”); it = iter(file); print(next(it))

Playing with iterators - enumerate and zip:

  • The function enumerate(myIterable) returns an “enumerate” class object with both items and their indices
    • Running list() on the “enumerate” class object will make a list out of the tuples [(index1, item1), (index2, item2), . . . ]
    • The default is for index1=0, though the argument start= may be included in the enumerate() call for a different starting index
  • The zip(myIter01, myIter02) will create a “zip” class object of the iterators
    • list(zip()) will return a list of tuples, starting with (allItem1), (allIterm2), . . .
    • Seems to require that the iterators all be the same length, else only items until running out of the shortest iterator will be tupled

Using iterators to load large files in to memory - loading data in chunks:

  • Common strategy with large files is to read in some data, process it, save the results, discard the data, and then repeat
  • The pandas.read_csv() has an option for chunksize= that allows for reading chunks of any given size

Example code includes:


# Create a list of strings: flash
flash = ['jay garrick', 'barry allen', 'wally west', 'bart allen']

# Print each list item in flash using a for loop
for person in flash : print(person)

# Create an iterator for flash: superspeed
superspeed = iter(flash)

# Print each item from the iterator
print(next(superspeed))
print(next(superspeed))
print(next(superspeed))
print(next(superspeed))


# Create an iterator for range(3): small_value
small_value = iter(range(3))

# Print the values in small_value
print(next(small_value))
print(next(small_value))
print(next(small_value))

# Loop over range(3) and print the values
for num in range(3) : print(num)


# Create an iterator for range(10 ** 100): googol
googol = iter(range(10 ** 100))

# Print the first 5 values from googol
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))


# Create a range object: values
values = range(10, 21)

# Print the range object
print(values)

# Create a list of integers: values_list
values_list = list(values)

# Print values_list
print(values_list)

# Get the sum of values: values_sum
values_sum = sum(values)

# Print values_sum
print(values_sum)


# Create a list of strings: mutants
mutants = ['charles xavier', 
            'bobby drake', 
            'kurt wagner', 
            'max eisenhardt', 
            'kitty pride']

# Create a list of tuples: mutant_list
mutant_list = list(enumerate(mutants))

# Print the list of tuples
print(mutant_list)

# Unpack and print the tuple pairs
for index1, value1 in mutant_list :
    print(index1, value1)

# Change the start index
for index2, value2 in list(enumerate(mutants, start=1)) :
    print(index2, value2)


aliases = ['prof x', 'iceman', 'nightcrawler', 'magneto', 'shadowcat']
powers = ['telepathy', 'thermokinesis', 'teleportation', 'magnetokinesis', 'intangibility' ]

# Create a list of tuples: mutant_data
mutant_data = list(zip(mutants, aliases, powers))

# Print the list of tuples
print(mutant_data)

# Create a zip object using the three lists: mutant_zip
mutant_zip = zip(mutants, aliases, powers)

# Print the zip object
print(mutant_zip)

# Unpack the zip object and print the tuple values
for value1, value2, value3 in mutant_zip :
    print(value1, value2, value3)


# Create a zip object from mutants and powers: z1
z1 = zip(mutants, powers)

# Print the tuples in z1 by unpacking with *
print(*z1)

# Re-create a zip object from mutants and powers: z1
z1 = zip(mutants, powers)

# 'Unzip' the tuples in z1 by unpacking with * and zip(): result1, result2
result1, result2 = zip(*z1)

# Check if unpacked tuples are equivalent to original tuples
print(result1 == tuple(mutants))
print(result2 == tuple(powers))


import pandas as pd

# Initialize an empty dictionary: counts_dict
counts_dict = dict()

# DO NOT HAVE FILE tweets.csv
# Created in R - see above for code
# Iterate over the file chunk by chunk
for chunk in pd.read_csv("tweets.csv", chunksize=10):
    # Iterate over the column in DataFrame
    for entry in chunk['lang']:
        if entry in counts_dict.keys():
            counts_dict[entry] += 1
        else:
            counts_dict[entry] = 1

# Print the populated dictionary
print(counts_dict)


# Define count_entries()
def count_entries(csv_file, c_size, colname):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Iterate over the file chunk by chunk
    for chunk in pd.read_csv(csv_file, chunksize=c_size):

        # Iterate over the column in DataFrame
        for entry in chunk[colname]:
            if entry in counts_dict.keys():
                counts_dict[entry] += 1
            else:
                counts_dict[entry] = 1

    # Return counts_dict
    return counts_dict

# Call count_entries(): result_counts
result_counts = count_entries("tweets.csv", 10, "lang")

# Print result_counts
print(result_counts)
## jay garrick
## barry allen
## wally west
## bart allen
## jay garrick
## barry allen
## wally west
## bart allen
## 0
## 1
## 2
## 0
## 1
## 2
## 0
## 1
## 2
## 3
## 4
## range(10, 21)
## [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
## 165
## [(0, 'charles xavier'), (1, 'bobby drake'), (2, 'kurt wagner'), (3, 'max eisenhardt'), (4, 'kitty pride')]
## 0 charles xavier
## 1 bobby drake
## 2 kurt wagner
## 3 max eisenhardt
## 4 kitty pride
## 1 charles xavier
## 2 bobby drake
## 3 kurt wagner
## 4 max eisenhardt
## 5 kitty pride
## [('charles xavier', 'prof x', 'telepathy'), ('bobby drake', 'iceman', 'thermokinesis'), ('kurt wagner', 'nightcrawler', 'teleportation'), ('max eisenhardt', 'magneto', 'magnetokinesis'), ('kitty pride', 'shadowcat', 'intangibility')]
## <zip object at 0x00AA4170>
## charles xavier prof x telepathy
## bobby drake iceman thermokinesis
## kurt wagner nightcrawler teleportation
## max eisenhardt magneto magnetokinesis
## kitty pride shadowcat intangibility
## ('charles xavier', 'telepathy') ('bobby drake', 'thermokinesis') ('kurt wagner', 'teleportation') ('max eisenhardt', 'magnetokinesis') ('kitty pride', 'intangibility')
## True
## True
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}

Chapter 2 - List comprehensions and generators

List comprehensions help address some of the inefficiencies (coding, run time, etc.) of using for loops for some tasks:

  • The syntax is [myDesiredCalcs for myVar in myIter] # should be square-bracketed
  • List comprehension may be used over any iterable; for example, [num for num in range(11)] will return [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
  • List comprehension can also be valuable in lieu of nested for loops; for example, with [(num1, num2) for num1 in range(0, 2) for num2 in range(6, 8)]
  • There is sometimes a trade-off for readability to keep in mind

Advanced comprehensions - additional functionality available:

  • [myDesiredCalcs for myVar in myIter if myCond] # allows for myCond to limit the myVar that are available to myDesiredCalcs (thus limiting the output list)
  • [myDesiredCalcs if myCond else myDefault for myVar in myIter] # allows for extracting where myCond is met and replacing with myDefault otherwise
  • {myCalc01 : myCalc02 for myVar in myIter} # produces a dictionary with key myCalc01 and value myCalc02

Introduction to generator expressions - creating generator objects rather than list/dictionaries:

  • (myDesiredCalcs for myVar in myIter) # will create a generator object rather than a list
  • Using a generator expression can help significantly with large sequences due to “lazy evaluation” (not evaluated until needed, such as next() being called)
    • While [num for num in range(10 ** 1000000)] will bomb out of memory, (num for num in range(10 ** 1000000)) is OK!
  • An additional nice feature is that all the conditionals can be run in the generator expression also
  • Note that using “yield” rather than “return” in a def (function) will build a generator function (it will return a generator object when called)

Wrapping up comprehensions and generators - helps with wrangling data:

  • Basic form - enclosed in brackets, output will be a list
  • More advanced forms - conditions on the iterator and/or iterable
  • Dictionaries - enclosed in braces
  • Generators - enclosed in parentheses

Example code includes:


doctor = ['house', 'cuddy', 'chase', 'thirteen', 'wilson']
[doc[0] for doc in doctor]

# Create list comprehension: squares
squares = [i ** 2 for i in range(0, 10)]


# Create a 5 x 5 matrix using a list of lists: matrix
matrix = [[col for col in range(5)] for row in range(5)]

# Print the matrix
for row in matrix:
    print(row)


# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# Create list comprehension: new_fellowship
new_fellowship = [member for member in fellowship if len(member) >= 7]

# Print the new list
print(new_fellowship)


# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# Create list comprehension: new_fellowship
new_fellowship = [member if len(member) >= 7 else "" for member in fellowship]

# Print the new list
print(new_fellowship)


# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# Create dict comprehension: new_fellowship
new_fellowship = {member : len(member) for member in fellowship}

# Print the new list
print(new_fellowship)


# Create generator object: result
result = (num for num in range(16))

# Print the first 5 values
print(next(result))
print(next(result))
print(next(result))
print(next(result))
print(next(result))

# Print the rest of the values
# NOTE - only will print 5-15 since 0-4 have previously been "consumed" above
for value in result:
    print(value)


# Create a list of strings: lannister
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Create a generator object: lengths
lengths = (len(person) for person in lannister)

# Iterate over and print the values in lengths
for value in lengths:
    print(value)


# Create a list of strings
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Define generator function get_lengths
def get_lengths(input_list):
    """Generator function that yields the
    length of the strings in input_list."""
    # Yield the length of a string
    for person in input_list:
        yield len(person)

# Print the values generated by get_lengths()
for value in get_lengths(lannister):
    print(value)


# DO NOT HAVE panda "df"
# Extract the created_at column from df: tweet_time
# tweet_time = df["created_at"]

# Extract the clock time: tweet_clock_time
# tweet_clock_time = [entry[11:19] for entry in tweet_time]

# Print the extracted times
# print(tweet_clock_time)


# Extract the created_at column from df: tweet_time
# tweet_time = df['created_at']

# Extract the clock time: tweet_clock_time
# tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == "19"]

# Print the extracted times
# print(tweet_clock_time)
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## ['samwise', 'aragorn', 'legolas', 'boromir']
## ['', 'samwise', '', 'aragorn', 'legolas', 'boromir', '']
## {'frodo': 5, 'samwise': 7, 'merry': 5, 'aragorn': 7, 'legolas': 7, 'boromir': 7, 'gimli': 5}
## 0
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 6
## 5
## 5
## 6
## 7
## 6
## 5
## 5
## 6
## 7

Chapter 3 - Bringing it all together (case study)

Welcome to the case study - previous two course techniques:

  • Wrangle and extract data from the World Bank Indicators dataset (1960-2015 data on 227 countries)
  • Recall that zip(a, b, . . . ) creates an iterable of tuples conmtaining (a1, b1, …), (a2, b2, …), …

Using Python generators for streaming data:

  • Generators are helpful for reading large files - in fact, they work even on files that are being written (as long as the read stays behind the write)
  • Generator functions are written like regular functions, but they have a “yield” (put it in to the generator and keep going) rather than a “return” (return the value and stop)
  • Goal will be to write a generator to read streaming data

Reading files in chunks with pandas.read_csv():

  • pandas.read_csv(file, chunksize= ) # allows the file to be read in chunks of size chunksize

Example code includes:


row_vals = [ 'Arab World', 'ARB', 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'SP.ADO.TFRT', '1960', '133.56090740552298' ]

feature_names = [ 'CountryName', 'CountryCode', 'IndicatorName', 'IndicatorCode', 'Year', 'Value' ]

# Zip lists: zipped_lists
zipped_lists = zip(feature_names, row_vals)

# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)

# Print the dictionary
print(rs_dict)


# Define lists2dict()
def lists2dict(list1, list2):
    """Return a dictionary where list1 provides
    the keys and list2 provides the values."""
    
    # Zip lists: zipped_lists
    zipped_lists = zip(list1, list2)
    
    # Create a dictionary: rs_dict
    rs_dict = dict(zipped_lists)
    
    # Return the dictionary
    return rs_dict

# Call lists2dict: rs_fxn
rs_fxn = lists2dict(feature_names, row_vals)

# Print rs_fxn
print(rs_fxn)


# Create list row_lists
regn = ['Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World']

abb = ['ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB']

indName = ['Adolescent fertility rate (births per 1,000 women ages 15-19)', 'Age dependency ratio (% of working-age population)', 'Age dependency ratio, old (% of working-age population)', 'Age dependency ratio, young (% of working-age population)', 'Arms exports (SIPRI trend indicator values)', 'Arms imports (SIPRI trend indicator values)', 'Birth rate, crude (per 1,000 people)', 'CO2 emissions (kt)', 'CO2 emissions (metric tons per capita)', 'CO2 emissions from gaseous fuel consumption (% of total)', 'CO2 emissions from liquid fuel consumption (% of total)', 'CO2 emissions from liquid fuel consumption (kt)', 'CO2 emissions from solid fuel consumption (% of total)', 'Death rate, crude (per 1,000 people)', 'Fertility rate, total (births per woman)', 'Fixed telephone subscriptions', 'Fixed telephone subscriptions (per 100 people)', 'Hospital beds (per 1,000 people)', 'International migrant stock (% of population)', 'International migrant stock, total' ]

indCode = ['SP.ADO.TFRT', 'SP.POP.DPND', 'SP.POP.DPND.OL', 'SP.POP.DPND.YG', 'MS.MIL.XPRT.KD', 'MS.MIL.MPRT.KD', 'SP.DYN.CBRT.IN', 'EN.ATM.CO2E.KT', 'EN.ATM.CO2E.PC', 'EN.ATM.CO2E.GF.ZS', 'EN.ATM.CO2E.LF.ZS', 'EN.ATM.CO2E.LF.KT', 'EN.ATM.CO2E.SF.ZS', 'SP.DYN.CDRT.IN', 'SP.DYN.TFRT.IN', 'IT.MLT.MAIN', 'IT.MLT.MAIN.P2', 'SH.MED.BEDS.ZS', 'SM.POP.TOTL.ZS', 'SM.POP.TOTL']

year = ['1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960']

value = ['133.56090740552298', '87.7976011532547', '6.634579191565161', '81.02332950839141', '3000000.0', '538000000.0', '47.697888095096395', '59563.9892169935', '0.6439635478877049', '5.041291753975099', '84.8514729446567', '49541.707291032304', '4.72698138789597', '19.7544519237187', '6.92402738655897', '406833.0', '0.6167005703199', '1.9296220724398703', '2.9906371279862403', '3324685.0']

row_lists=list(zip(regn, abb, indName, indCode, year, value))

# Print the first two lists in row_lists
print(row_lists[0])
print(row_lists[1])

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])

# Import the pandas package
import pandas as pd

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Turn list of dicts into a DataFrame: df
df = pd.DataFrame(list_of_dicts)

# Print the head of the DataFrame
print(df.head())

# REFERENCE DATA POSSIBLY AT http://data.worldbank.org/data-catalog/world-development-indicators
# Created relevant file "world_dev_ind.csv" using Python and World Bank download
# Open a connection to the file
with open("world_dev_ind.csv") as file:
    
    # Skip the column names
    file.readline()
    
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}
    
    # Process only the first 1000 rows
    for j in range(1000):
        
        # Split the current line into a list: line
        line = file.readline().split(',')
        
        # Get the value for the first column: first_col
        first_col = line[0]
        
        # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        
        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1

# Print the resulting dictionary
print(counts_dict)

# Define read_large_file()
def read_large_file(file_object):
    """A generator function to read a large file lazily."""
    
    # Loop indefinitely until the end of the file
    while True:
        
        # Read a line from the file: data
        data = file_object.readline()
        
        # Break if this is the end of the file
        if not data:
            break
        
        # Yield the line of data
        yield data
        
# Open a connection to the file
with open('world_dev_ind.csv') as file:
    
    # Create a generator object for the file: gen_file
    gen_file = read_large_file(file)
    
    # Print the first three lines of the file
    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))


# Initialize an empty dictionary: counts_dict
counts_dict = {}


# Open a connection to the file
with open("world_dev_ind.csv") as file:
    
    # Iterate over the generator from read_large_file()
    for line in read_large_file(file):
        row = line.split(',')
        first_col = row[0]
        
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        else:
            counts_dict[first_col] = 1

# Print            
print(counts_dict)


# DO NOT HAVE FILE ind_pop.csv (CountryName,CountryCode,IndicatorName,IndicatorCode,Year,Value\n)
# Value for regions of CountryName/CountryCode - fixing Urban population (% of total), SP.URB.TOTL.IN.ZS , 1960
# Just changed it to use "world_dev_ind.csv"
# Import the pandas package
import pandas as pd
import matplotlib.pyplot as plt


# Initialize reader object: df_reader
df_reader = pd.read_csv("world_dev_ind.csv", chunksize=10)

# Print two chunks
print(next(df_reader))
print(next(df_reader))


# DO NOT HAVE FILE ind_pop_data.csv 
# ('CountryName,CountryCode,Year,Total Population,Urban population (% of total)\n)
# Appears to be 1960-1964
# Initialize reader object: urb_pop_reader
# Create file using Python, needs to read in using encoding="latin-1"
urb_pop_reader = pd.read_csv("ind_pop_data.csv", chunksize=2500, encoding="latin-1")

# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)

# Check out the head of the DataFrame
print(df_urb_pop.head())

# Check out specific country: df_pop_ceb
idxCeb = df_urb_pop[df_urb_pop["CountryCode"] == "CEB"].index
df_pop_ceb = df_urb_pop.loc[idxCeb, :]  # Make sure it is not just a reference . . . 

# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb["Total Population"], df_pop_ceb["Urban population (% of total)"])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Print pops_list
print(pops_list)


# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv("ind_pop_data.csv", chunksize=2500, encoding="latin-1")

# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)

# Check out specific country: df_pop_ceb
idxCeb = df_urb_pop[df_urb_pop["CountryCode"] == "CEB"].index
df_pop_ceb = df_urb_pop.loc[idxCeb, :]  # Make sure it is not just a reference . . . 
# df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'], 
            df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Use list comprehension to create new DataFrame column 'Total Urban Population'
# df_pop_ceb["Total Urban Population"] = df_pop_ceb["Total Population"]
# a = [int(0.01 * tup[0] * tup[1]) for tup in pops_list]
df_pop_ceb['Total Urban Population'] = [int(0.01 * tup[0] * tup[1]) for tup in pops_list]


# Plot urban population data
df_pop_ceb.plot(kind="scatter", x="Year", y="Total Urban Population")
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy020.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000, encoding="latin-1")

# Initialize empty DataFrame: data
data = pd.DataFrame()

# Iterate over each DataFrame chunk
for df_urb_pop in urb_pop_reader:

    # Check out specific country: df_pop_ceb
    idxCeb = df_urb_pop[df_urb_pop["CountryCode"] == "CEB"].index
    df_pop_ceb = df_urb_pop.loc[idxCeb, :]  # Make sure it is not just a reference . . . 

    # Zip DataFrame columns of interest: pops
    pops = zip(df_pop_ceb['Total Population'],
                df_pop_ceb['Urban population (% of total)'])

    # Turn zip object into list: pops_list
    pops_list = list(pops)

    # Use list comprehension to create new DataFrame column 'Total Urban Population'
    # df_pop_ceb["Total Urban Population"] = df_pop_ceb["Total Population"]
    # a = [int(0.01 * tup[0] * tup[1]) for tup in pops_list]
    df_pop_ceb['Total Urban Population'] = [int(0.01 * tup[0] * tup[1]) for tup in pops_list]
    
    # Append DataFrame chunk to data: data
    data = data.append(df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy021.png", bbox_inches="tight")
plt.clf()  # Required to prevent continued over-plotting


# Define plot_pop()
def plot_pop(filename, country_code, pngCode=False):
    
    # Initialize reader object: urb_pop_reader
    urb_pop_reader = pd.read_csv(filename, chunksize=1000, encoding="latin-1")
    
    # Initialize empty DataFrame: data
    data = pd.DataFrame()
    
    # Iterate over each DataFrame chunk
    for df_urb_pop in urb_pop_reader:
        # Check out specific country: df_pop_ceb
        idxCeb = df_urb_pop[df_urb_pop["CountryCode"] == country_code].index
        df_pop_ceb = df_urb_pop.loc[idxCeb, :]  # Make sure it is not just a reference . . . 
        
        # Zip DataFrame columns of interest: pops
        pops = zip(df_pop_ceb['Total Population'],
                    df_pop_ceb['Urban population (% of total)'])
        
        # Turn zip object into list: pops_list
        pops_list = list(pops)
        
        # Use list comprehension to create new DataFrame column 'Total Urban Population'
        # df_pop_ceb["Total Urban Population"] = df_pop_ceb["Total Population"]
        # a = [int(0.01 * tup[0] * tup[1]) for tup in pops_list]
        # df_pop_ceb.loc[df_pop_ceb.index, 'Total Urban Population'] = a
        df_pop_ceb['Total Urban Population'] = [int(0.01 * tup[0] * tup[1]) for tup in pops_list]
        
        # Append DataFrame chunk to data: data
        data = data.append(df_pop_ceb)
        
    # Plot urban population data
    data.plot(kind='scatter', x='Year', y='Total Urban Population')
    if pngCode == False :
        plt.show()  # Plot by default
    else :
        plt.savefig(pngCode, bbox_inches="tight") # Save as dummy PNG instead
    
    plt.clf()  # Required to prevent continued over-plotting

# Set the filename: fn
fn = 'ind_pop_data.csv'

# Call plot_pop for country code 'CEB'
plot_pop(fn, "CEB", "_dummyPy022.png")

# Call plot_pop for country code 'ARB'
plot_pop(fn, "ARB", "_dummyPy023.png")
## {'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}
## {'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}
## ('Arab World', 'ARB', 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'SP.ADO.TFRT', '1960', '133.56090740552298')
## ('Arab World', 'ARB', 'Age dependency ratio (% of working-age population)', 'SP.POP.DPND', '1960', '87.7976011532547')
## {'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}
## {'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Age dependency ratio (% of working-age population)', 'IndicatorCode': 'SP.POP.DPND', 'Year': '1960', 'Value': '87.7976011532547'}
##   CountryCode CountryName   IndicatorCode  \
## 0         ARB  Arab World     SP.ADO.TFRT   
## 1         ARB  Arab World     SP.POP.DPND   
## 2         ARB  Arab World  SP.POP.DPND.OL   
## 3         ARB  Arab World  SP.POP.DPND.YG   
## 4         ARB  Arab World  MS.MIL.XPRT.KD   
## 
##                                        IndicatorName               Value  Year  
## 0  Adolescent fertility rate (births per 1,000 wo...  133.56090740552298  1960  
## 1  Age dependency ratio (% of working-age populat...    87.7976011532547  1960  
## 2  Age dependency ratio, old (% of working-age po...   6.634579191565161  1960  
## 3  Age dependency ratio, young (% of working-age ...   81.02332950839141  1960  
## 4        Arms exports (SIPRI trend indicator values)           3000000.0  1960  
## {'Arab World': 6, 'Caribbean small states': 6, 'Central Europe and the Baltics': 6, 'Early-demographic dividend': 6, 'East Asia & Pacific': 6, 'East Asia & Pacific (excluding high income)': 6, 'East Asia & Pacific (IDA & IBRD countries)': 6, 'Euro area': 6, 'Europe & Central Asia': 6, 'Europe & Central Asia (excluding high income)': 6, 'Europe & Central Asia (IDA & IBRD countries)': 6, 'European Union': 6, 'Fragile and conflict affected situations': 6, 'Heavily indebted poor countries (HIPC)': 6, 'High income': 6, 'IBRD only': 6, 'IDA & IBRD total': 6, 'IDA blend': 6, 'IDA only': 6, 'IDA total': 6, 'Late-demographic dividend': 6, 'Latin America & Caribbean': 6, 'Latin America & Caribbean (excluding high income)': 6, 'Latin America & the Caribbean (IDA & IBRD countries)': 6, 'Least developed countries: UN classification': 6, 'Low & middle income': 6, 'Low income': 6, 'Lower middle income': 6, 'Middle East & North Africa': 6, 'Middle East & North Africa (excluding high income)': 6, 'Middle East & North Africa (IDA & IBRD countries)': 6, 'Middle income': 6, 'North America': 6, 'OECD members': 6, 'Other small states': 6, 'Pacific island small states': 6, 'Post-demographic dividend': 6, 'Pre-demographic dividend': 6, 'Small states': 6, 'South Asia': 6, 'South Asia (IDA & IBRD)': 6, 'Sub-Saharan Africa': 6, 'Sub-Saharan Africa (excluding high income)': 6, 'Sub-Saharan Africa (IDA & IBRD countries)': 6, 'Upper middle income': 6, 'World': 6, 'Afghanistan': 6, 'Albania': 6, 'Algeria': 6, 'American Samoa': 6, 'Andorra': 6, 'Angola': 6, 'Antigua and Barbuda': 6, 'Argentina': 6, 'Armenia': 6, 'Aruba': 6, 'Australia': 6, 'Austria': 6, 'Azerbaijan': 6, '"Bahamas': 6, 'Bahrain': 6, 'Bangladesh': 6, 'Barbados': 6, 'Belarus': 6, 'Belgium': 6, 'Belize': 6, 'Benin': 6, 'Bermuda': 6, 'Bhutan': 6, 'Bolivia': 6, 'Bosnia and Herzegovina': 6, 'Botswana': 6, 'Brazil': 6, 'British Virgin Islands': 6, 'Brunei Darussalam': 3, 'Bulgaria': 3, 'Burkina Faso': 3, 'Burundi': 3, 'Cabo Verde': 3, 'Cambodia': 3, 'Cameroon': 3, 'Canada': 3, 'Cayman Islands': 3, 'Central African Republic': 3, 'Chad': 3, 'Channel Islands': 3, 'Chile': 3, 'China': 3, 'Colombia': 3, 'Comoros': 3, '"Congo': 6, 'Costa Rica': 3, "Cote d'Ivoire": 3, 'Croatia': 3, 'Cuba': 3, 'Curacao': 3, 'Cyprus': 3, 'Czech Republic': 3, 'Denmark': 3, 'Djibouti': 3, 'Dominica': 3, 'Dominican Republic': 3, 'Ecuador': 3, '"Egypt': 3, 'El Salvador': 3, 'Equatorial Guinea': 3, 'Eritrea': 3, 'Estonia': 3, 'Ethiopia': 3, 'Faroe Islands': 3, 'Fiji': 3, 'Finland': 3, 'France': 3, 'French Polynesia': 3, 'Gabon': 3, '"Gambia': 3, 'Georgia': 3, 'Germany': 3, 'Ghana': 3, 'Gibraltar': 3, 'Greece': 3, 'Greenland': 3, 'Grenada': 3, 'Guam': 3, 'Guatemala': 3, 'Guinea': 3, 'Guinea-Bissau': 3, 'Guyana': 3, 'Haiti': 3, 'Honduras': 3, '"Hong Kong SAR': 3, 'Hungary': 3, 'Iceland': 3, 'India': 3, 'Indonesia': 3, '"Iran': 3, 'Iraq': 3, 'Ireland': 3, 'Isle of Man': 3, 'Israel': 3, 'Italy': 3, 'Jamaica': 3, 'Japan': 3, 'Jordan': 3, 'Kazakhstan': 3, 'Kenya': 3, 'Kiribati': 3, '"Korea': 6, 'Kosovo': 1, 'Kuwait': 3, 'Kyrgyz Republic': 3, 'Lao PDR': 3, 'Latvia': 3, 'Lebanon': 3, 'Lesotho': 3, 'Liberia': 3, 'Libya': 3, 'Liechtenstein': 3, 'Lithuania': 3, 'Luxembourg': 3, '"Macao SAR': 3, '"Macedonia': 3, 'Madagascar': 3, 'Malawi': 3, 'Malaysia': 3, 'Maldives': 3, 'Mali': 3, 'Malta': 3, 'Marshall Islands': 3, 'Mauritania': 3, 'Mauritius': 3, 'Mexico': 3, '"Micronesia': 3, 'Moldova': 3, 'Monaco': 3, 'Mongolia': 3, 'Montenegro': 3, 'Morocco': 3, 'Mozambique': 3, 'Myanmar': 3, 'Namibia': 3, 'Nauru': 3, 'Nepal': 3, 'Netherlands': 3, 'New Caledonia': 3, 'New Zealand': 3, 'Nicaragua': 3, 'Niger': 3, 'Nigeria': 3, 'Northern Mariana Islands': 3, 'Norway': 3, 'Oman': 3, 'Pakistan': 3, 'Palau': 3, 'Panama': 3, 'Papua New Guinea': 3, 'Paraguay': 3, 'Peru': 3, 'Philippines': 3, 'Poland': 3, 'Portugal': 3, 'Puerto Rico': 3, 'Qatar': 3, 'Romania': 3, 'Russian Federation': 3, 'Rwanda': 3, 'Samoa': 3, 'San Marino': 3, 'Sao Tome and Principe': 3, 'Saudi Arabia': 3, 'Senegal': 3, 'Seychelles': 3, 'Sierra Leone': 3, 'Singapore': 3, 'Sint Maarten (Dutch part)': 2, 'Slovak Republic': 3, 'Slovenia': 3, 'Solomon Islands': 3, 'Somalia': 3, 'South Africa': 3, 'South Sudan': 3, 'Spain': 3, 'Sri Lanka': 3, 'St. Kitts and Nevis': 3, 'St. Lucia': 3, 'St. Martin (French part)': 1, 'St. Vincent and the Grenadines': 3, 'Sudan': 3, 'Suriname': 3, 'Swaziland': 3, 'Sweden': 3, 'Switzerland': 3, 'Syrian Arab Republic': 3, 'Tajikistan': 3, 'Tanzania': 3, 'Thailand': 3, 'Timor-Leste': 3, 'Togo': 3, 'Tonga': 3, 'Trinidad and Tobago': 3, 'Tunisia': 3, 'Turkey': 3, 'Turkmenistan': 3, 'Turks and Caicos Islands': 3, 'Tuvalu': 3, 'Uganda': 3, 'Ukraine': 3, 'United Arab Emirates': 3, 'United Kingdom': 3, 'United States': 3, 'Uruguay': 3, 'Uzbekistan': 3, 'Vanuatu': 3, '"Venezuela': 3, 'Vietnam': 3, 'Virgin Islands (U.S.)': 3, '"Yemen': 3, 'Zambia': 3, 'Zimbabwe': 3}
## Country Name,Country Code,Indicator Name,Indicator Code,year,value
## 
## Arab World,ARB,"Population, total",SP.POP.TOTL,1960,92496099.0
## 
## Arab World,ARB,Rural population (% of total population),SP.RUR.TOTL.ZS,1960,68.7081520885329
## 
## {'Country Name': 1, 'Arab World': 168, 'Caribbean small states': 168, 'Central Europe and the Baltics': 168, 'Early-demographic dividend': 168, 'East Asia & Pacific': 168, 'East Asia & Pacific (excluding high income)': 168, 'East Asia & Pacific (IDA & IBRD countries)': 168, 'Euro area': 168, 'Europe & Central Asia': 168, 'Europe & Central Asia (excluding high income)': 168, 'Europe & Central Asia (IDA & IBRD countries)': 168, 'European Union': 168, 'Fragile and conflict affected situations': 168, 'Heavily indebted poor countries (HIPC)': 168, 'High income': 168, 'IBRD only': 168, 'IDA & IBRD total': 168, 'IDA blend': 168, 'IDA only': 168, 'IDA total': 168, 'Late-demographic dividend': 168, 'Latin America & Caribbean': 168, 'Latin America & Caribbean (excluding high income)': 168, 'Latin America & the Caribbean (IDA & IBRD countries)': 168, 'Least developed countries: UN classification': 168, 'Low & middle income': 168, 'Low income': 168, 'Lower middle income': 168, 'Middle East & North Africa': 168, 'Middle East & North Africa (excluding high income)': 168, 'Middle East & North Africa (IDA & IBRD countries)': 168, 'Middle income': 168, 'North America': 168, 'OECD members': 168, 'Other small states': 168, 'Pacific island small states': 168, 'Post-demographic dividend': 168, 'Pre-demographic dividend': 168, 'Small states': 168, 'South Asia': 168, 'South Asia (IDA & IBRD)': 168, 'Sub-Saharan Africa': 168, 'Sub-Saharan Africa (excluding high income)': 168, 'Sub-Saharan Africa (IDA & IBRD countries)': 168, 'Upper middle income': 168, 'World': 168, 'Afghanistan': 168, 'Albania': 168, 'Algeria': 168, 'American Samoa': 168, 'Andorra': 168, 'Angola': 168, 'Antigua and Barbuda': 168, 'Argentina': 168, 'Armenia': 168, 'Aruba': 168, 'Australia': 168, 'Austria': 168, 'Azerbaijan': 168, '"Bahamas': 168, 'Bahrain': 168, 'Bangladesh': 168, 'Barbados': 168, 'Belarus': 168, 'Belgium': 168, 'Belize': 168, 'Benin': 168, 'Bermuda': 168, 'Bhutan': 168, 'Bolivia': 168, 'Bosnia and Herzegovina': 168, 'Botswana': 168, 'Brazil': 168, 'British Virgin Islands': 168, 'Brunei Darussalam': 168, 'Bulgaria': 168, 'Burkina Faso': 168, 'Burundi': 168, 'Cabo Verde': 168, 'Cambodia': 168, 'Cameroon': 168, 'Canada': 168, 'Cayman Islands': 168, 'Central African Republic': 168, 'Chad': 168, 'Channel Islands': 168, 'Chile': 168, 'China': 168, 'Colombia': 168, 'Comoros': 168, '"Congo': 336, 'Costa Rica': 168, "Cote d'Ivoire": 168, 'Croatia': 168, 'Cuba': 168, 'Curacao': 168, 'Cyprus': 168, 'Czech Republic': 168, 'Denmark': 168, 'Djibouti': 168, 'Dominica': 168, 'Dominican Republic': 168, 'Ecuador': 168, '"Egypt': 168, 'El Salvador': 168, 'Equatorial Guinea': 168, 'Eritrea': 156, 'Estonia': 168, 'Ethiopia': 168, 'Faroe Islands': 168, 'Fiji': 168, 'Finland': 168, 'France': 168, 'French Polynesia': 168, 'Gabon': 168, '"Gambia': 168, 'Georgia': 168, 'Germany': 168, 'Ghana': 168, 'Gibraltar': 168, 'Greece': 168, 'Greenland': 168, 'Grenada': 168, 'Guam': 168, 'Guatemala': 168, 'Guinea': 168, 'Guinea-Bissau': 168, 'Guyana': 168, 'Haiti': 168, 'Honduras': 168, '"Hong Kong SAR': 168, 'Hungary': 168, 'Iceland': 168, 'India': 168, 'Indonesia': 168, '"Iran': 168, 'Iraq': 168, 'Ireland': 168, 'Isle of Man': 168, 'Israel': 168, 'Italy': 168, 'Jamaica': 168, 'Japan': 168, 'Jordan': 168, 'Kazakhstan': 168, 'Kenya': 168, 'Kiribati': 168, '"Korea': 336, 'Kosovo': 56, 'Kuwait': 165, 'Kyrgyz Republic': 168, 'Lao PDR': 168, 'Latvia': 168, 'Lebanon': 168, 'Lesotho': 168, 'Liberia': 168, 'Libya': 168, 'Liechtenstein': 168, 'Lithuania': 168, 'Luxembourg': 168, '"Macao SAR': 168, '"Macedonia': 168, 'Madagascar': 168, 'Malawi': 168, 'Malaysia': 168, 'Maldives': 168, 'Mali': 168, 'Malta': 168, 'Marshall Islands': 168, 'Mauritania': 168, 'Mauritius': 168, 'Mexico': 168, '"Micronesia': 168, 'Moldova': 168, 'Monaco': 168, 'Mongolia': 168, 'Montenegro': 168, 'Morocco': 168, 'Mozambique': 168, 'Myanmar': 168, 'Namibia': 168, 'Nauru': 168, 'Nepal': 168, 'Netherlands': 168, 'New Caledonia': 168, 'New Zealand': 168, 'Nicaragua': 168, 'Niger': 168, 'Nigeria': 168, 'Northern Mariana Islands': 168, 'Norway': 168, 'Oman': 168, 'Pakistan': 168, 'Palau': 168, 'Panama': 168, 'Papua New Guinea': 168, 'Paraguay': 168, 'Peru': 168, 'Philippines': 168, 'Poland': 168, 'Portugal': 168, 'Puerto Rico': 168, 'Qatar': 168, 'Romania': 168, 'Russian Federation': 168, 'Rwanda': 168, 'Samoa': 168, 'San Marino': 168, 'Sao Tome and Principe': 168, 'Saudi Arabia': 168, 'Senegal': 168, 'Seychelles': 168, 'Sierra Leone': 168, 'Singapore': 168, 'Sint Maarten (Dutch part)': 130, 'Slovak Republic': 168, 'Slovenia': 168, 'Solomon Islands': 168, 'Somalia': 168, 'South Africa': 168, 'South Sudan': 168, 'Spain': 168, 'Sri Lanka': 168, 'St. Kitts and Nevis': 168, 'St. Lucia': 168, 'St. Martin (French part)': 56, 'St. Vincent and the Grenadines': 168, 'Sudan': 168, 'Suriname': 168, 'Swaziland': 168, 'Sweden': 168, 'Switzerland': 168, 'Syrian Arab Republic': 168, 'Tajikistan': 168, 'Tanzania': 168, 'Thailand': 168, 'Timor-Leste': 168, 'Togo': 168, 'Tonga': 168, 'Trinidad and Tobago': 168, 'Tunisia': 168, 'Turkey': 168, 'Turkmenistan': 168, 'Turks and Caicos Islands': 168, 'Tuvalu': 168, 'Uganda': 168, 'Ukraine': 168, 'United Arab Emirates': 168, 'United Kingdom': 168, 'United States': 168, 'Uruguay': 168, 'Uzbekistan': 168, 'Vanuatu': 168, '"Venezuela': 168, 'Vietnam': 168, 'Virgin Islands (U.S.)': 168, '"Yemen': 168, 'Zambia': 168, 'Zimbabwe': 168, 'Serbia': 78, 'West Bank and Gaza': 78}
##                      Country Name Country Code  \
## 0                      Arab World          ARB   
## 1                      Arab World          ARB   
## 2                      Arab World          ARB   
## 3          Caribbean small states          CSS   
## 4          Caribbean small states          CSS   
## 5          Caribbean small states          CSS   
## 6  Central Europe and the Baltics          CEB   
## 7  Central Europe and the Baltics          CEB   
## 8  Central Europe and the Baltics          CEB   
## 9      Early-demographic dividend          EAR   
## 
##                              Indicator Name     Indicator Code  year  \
## 0                         Population, total        SP.POP.TOTL  1960   
## 1  Rural population (% of total population)     SP.RUR.TOTL.ZS  1960   
## 2             Urban population (% of total)  SP.URB.TOTL.IN.ZS  1960   
## 3                         Population, total        SP.POP.TOTL  1960   
## 4  Rural population (% of total population)     SP.RUR.TOTL.ZS  1960   
## 5             Urban population (% of total)  SP.URB.TOTL.IN.ZS  1960   
## 6                         Population, total        SP.POP.TOTL  1960   
## 7  Rural population (% of total population)     SP.RUR.TOTL.ZS  1960   
## 8             Urban population (% of total)  SP.URB.TOTL.IN.ZS  1960   
## 9                         Population, total        SP.POP.TOTL  1960   
## 
##           value  
## 0  9.249610e+07  
## 1  6.870815e+01  
## 2  3.129185e+01  
## 3  4.192721e+06  
## 4  6.840152e+01  
## 5  3.159848e+01  
## 6  9.140158e+07  
## 7  5.549208e+01  
## 8  4.450792e+01  
## 9  9.800680e+08  
##                                    Country Name Country Code  \
## 10                   Early-demographic dividend          EAR   
## 11                   Early-demographic dividend          EAR   
## 12                          East Asia & Pacific          EAS   
## 13                          East Asia & Pacific          EAS   
## 14                          East Asia & Pacific          EAS   
## 15  East Asia & Pacific (excluding high income)          EAP   
## 16  East Asia & Pacific (excluding high income)          EAP   
## 17  East Asia & Pacific (excluding high income)          EAP   
## 18   East Asia & Pacific (IDA & IBRD countries)          TEA   
## 19   East Asia & Pacific (IDA & IBRD countries)          TEA   
## 
##                               Indicator Name     Indicator Code  year  \
## 10  Rural population (% of total population)     SP.RUR.TOTL.ZS  1960   
## 11             Urban population (% of total)  SP.URB.TOTL.IN.ZS  1960   
## 12                         Population, total        SP.POP.TOTL  1960   
## 13  Rural population (% of total population)     SP.RUR.TOTL.ZS  1960   
## 14             Urban population (% of total)  SP.URB.TOTL.IN.ZS  1960   
## 15                         Population, total        SP.POP.TOTL  1960   
## 16  Rural population (% of total population)     SP.RUR.TOTL.ZS  1960   
## 17             Urban population (% of total)  SP.URB.TOTL.IN.ZS  1960   
## 18                         Population, total        SP.POP.TOTL  1960   
## 19  Rural population (% of total population)     SP.RUR.TOTL.ZS  1960   
## 
##            value  
## 10  7.705007e+01  
## 11  2.294993e+01  
## 12  1.042480e+09  
## 13  7.752853e+01  
## 14  2.247147e+01  
## 15  8.964930e+08  
## 16  8.308232e+01  
## 17  1.691768e+01  
## 18  8.850532e+08  
## 19  8.338348e+01  
##    CountryName CountryCode  Year  Total Population  \
## 0  Afghanistan         AFG  1960         8994793.0   
## 1  Afghanistan         AFG  1961         9164945.0   
## 2  Afghanistan         AFG  1962         9343772.0   
## 3  Afghanistan         AFG  1963         9531555.0   
## 4  Afghanistan         AFG  1964         9728645.0   
## 
##    Urban population (% of total)  
## 0                          8.221  
## 1                          8.508  
## 2                          8.805  
## 3                          9.110  
## 4                          9.426  
## [(91401583.0, 44.507921139002597), (92237118.0, 45.206665319194002), (93014890.0, 45.866564696018003), (93845749.0, 46.5340927663649), (94722599.0, 47.208742980352604), (95447065.0, 47.8803084429574), (96148635.0, 48.505097191759397), (97043587.0, 49.067767135854098), (97882394.0, 49.638696249807701), (98602140.0, 50.215657693321887), (99133296.0, 50.780409860456999), (99638983.0, 51.429566445052899), (100363597.0, 52.162105936757101), (101120519.0, 52.894471471541799), (101946256.0, 53.627174447338199), (102862489.0, 54.349653085382698), (103770134.0, 55.061127012228795), (104589313.0, 55.7886862473798), (105304312.0, 56.530668389657201), (105924838.0, 57.213134522150497), (106564905.0, 57.822931161135998), (107187982.0, 58.286690506739795), (107770794.0, 58.683563322897996), (108326895.0, 59.081567030459006), (108853181.0, 59.480212620603197), (109360296.0, 59.873735202774107), (109847148.0, 60.258086533789701), (110296680.0, 60.638613615994608), (110688533.0, 61.020488525916214), (110801380.0, 61.312199620203295), (110745760.0, 61.520994481657802), (110290445.0, 61.741539221625203), (110005636.0, 61.820287894431203), (110081461.0, 61.779410244518786), (110019570.0, 61.751130812191001), (109913216.0, 61.715962603505297), (109563097.0, 61.695812920403299), (109459093.0, 61.661656630381501), (109207205.0, 61.632890822478196), (109092730.0, 61.595078134840001), (108405522.0, 61.567439264127209), (107800399.0, 61.571020971323101), (107097577.0, 61.629953932533901), (106760768.0, 61.670694008505109), (106466116.0, 61.711606934595004), (106173766.0, 61.7605094345057), (105901322.0, 61.815522500550095), (105504531.0, 61.887634194914291), (105126686.0, 61.964992935380899), (104924372.0, 62.020159705764101), (104543801.0, 62.059416833265885), (104174038.0, 62.099517041078904), (103935318.0, 62.141847349338995), (103713726.0, 62.197640397588302), (103496179.0, 62.269282909458894), (103256779.0, 62.357002383678797)]

Plots 20 and 21 are not displayed as they are redundant with plot 22.

Urban Population by Year for Country Code CEB:

Urban Population by Year for Country Code ARB:

Network Analysis in Python (Part I)

Chapter 1 - Introduction to Networks

Introduction to networks - examples like social networks, transportation networks, etc.:

  • Networks are a useful tool for modeling relationships between entities
  • Networks are defined by two sets of attributes; nodes and edges (these form a network, known in mathematics as a “graph”)
  • The “networkx” library is frequently imported as nx
  • The start of an empty network (Graph) can be defined as G = nx.Graph()
  • Nodes can be added using G.add_nodes_from([nodeList])
    • The call to G.nodes() provides a list of the nodes currently in the Graph
  • The call to G.add_edge(myTuple) will create a link (edge) as defined by myTuple
    • The call to G.edges will be a tuple showing all the edges currently defined
  • Metadata can further be added to the nodes, such as using G.node[1][“label”] = “blue”
    • The call to G.nodes(data=True) will then bring back the nodes and also the associated metadata (as dictionaries)
  • The nx.draw() function will draw out the Graph (requires plt.show() where plt is matplotlab.pyplot)

Types of graphs:

  • Undirected graphs (e.g., Facebook) are typically drawn as a line with no arrows between two circles
    • These are created empty as per above using nx.Graph()
  • Directed graphs (e.g., Twitter) are typically drawn as a line with an arrow (uni or bi directional depending on follow/follower) between two circles
    • These are created empty using nx.DiGraph()
  • Multi graphs (e.g., trips between bike stations) are typically drawn as many arrows between two circles
    • These are created empty using nx.MultiGraph()
  • Multi graphs can instead be created as weighted arrows, where the weight represents the frequency of occurrence (save memory, plotting, etc. vs. baseline)
    • The weight may just be included as part of the metadata dictionary
  • Self-loops are nodes that connect to themselves, such as bike trips that start and end at the same station

Network visualization - irrational (“looks like a hairball”) and rational visualizations:

  • Three primary types of plots - Matrix plot, Arc plot, Circos plot
  • The Matrix plot is a simple row-column, with the square filled in if the edge between the nodes exists
    • With an undirected Graph, the matrix will be symmetrical around the diagonal
    • With a directed Graph, the matrix need not be symmetrical around the diagonal (the columns are what the arrow hits, the rows are what it is from)
  • The Arc plot is a transformation where the nodes are all along a single axis of the plot, with connections drawn as semi-circles
  • The Circos plot is a trasnformation of the Arc plot, but where the “axis” is converted in to a circle
  • The “nxviz” package, typically imported as “nv”, allows for visualizing the Graphs the above types
    • ap = nv.ArcPlot(G) ; ap.draw() ; plt.show() will create the ArcPlot

Example code includes:


## NEED TO MOCK UP T_sub from the above
import networkx as nx
import datetime

T_sub = nx.DiGraph()

T_sub.add_nodes_from([1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])

T_sub.add_edges_from([(1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8), (1, 9), (1, 10), (1, 11), (1, 12), (1, 13), (1, 14), (1, 15), (1, 16), (1, 17), (1, 18), (1, 19), (1, 20), (1, 21), (1, 22), (1, 23), (1, 24), (1, 25), (1, 26), (1, 27), (1, 28), (1, 29), (1, 30), (1, 31), (1, 32), (1, 33), (1, 34), (1, 35), (1, 36), (1, 37), (1, 38), (1, 39), (1, 40), (1, 41), (1, 42), (1, 43), (1, 44), (1, 45), (1, 46), (1, 47), (1, 48), (1, 49), (16, 48), (16, 18), (16, 35), (16, 36), (18, 16), (18, 24), (18, 35), (18, 36), (19, 35), (19, 36), (19, 5), (19, 8), (19, 11), (19, 13), (19, 15), (19, 48), (19, 17), (19, 20), (19, 21), (19, 24), (19, 37), (19, 30), (19, 31), (28, 1), (28, 5), (28, 7), (28, 8), (28, 11), (28, 14), (28, 15), (28, 17), (28, 20), (28, 21), (28, 24), (28, 25), (28, 27), (28, 29), (28, 30), (28, 31), (28, 35), (28, 36), (28, 37), (28, 44), (28, 48), (28, 49), (36, 24), (36, 35), (36, 5), (36, 37), (37, 24), (37, 35), (37, 36), (39, 1), (39, 35), (39, 36), (39, 38), (39, 33), (39, 40), (39, 41), (39, 45), (39, 24), (42, 1), (43, 48), (43, 35), (43, 36), (43, 37), (43, 24), (43, 29), (43, 47), (45, 1), (45, 39), (45, 41)])

node_meta = [{'occupation': 'scientist', 'category': 'I'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'P'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'P'}]

for x in range(len(T_sub.nodes())) :
    T_sub.node[T_sub.nodes()[x]]["occupation"] = node_meta[x]["occupation"]
    T_sub.node[T_sub.nodes()[x]]["category"] = node_meta[x]["category"]

edge_meta = [{'date': datetime.date(2012, 11, 17)}, {'date': datetime.date(2007, 6, 19)}, {'date': datetime.date(2014, 3, 18)}, {'date': datetime.date(2007, 3, 18)}, {'date': datetime.date(2011, 12, 19)}, {'date': datetime.date(2013, 12, 7)}, {'date': datetime.date(2009, 11, 9)}, {'date': datetime.date(2008, 10, 7)}, {'date': datetime.date(2008, 8, 14)}, {'date': datetime.date(2011, 3, 22)}, {'date': datetime.date(2014, 8, 3)}, {'date': datetime.date(2007, 5, 19)}, {'date': datetime.date(2009, 12, 13)}, {'date': datetime.date(2011, 4, 7)}, {'date': datetime.date(2013, 8, 2)}, {'date': datetime.date(2014, 11, 17)}, {'date': datetime.date(2013, 5, 20)}, {'date': datetime.date(2010, 12, 15)}, {'date': datetime.date(2010, 11, 27)}, {'date': datetime.date(2013, 9, 5)}, {'date': datetime.date(2013, 3, 1)}, {'date': datetime.date(2007, 7, 8)}, {'date': datetime.date(2010, 5, 23)}, {'date': datetime.date(2007, 9, 14)}, {'date': datetime.date(2013, 1, 24)}, {'date': datetime.date(2013, 6, 21)}, {'date': datetime.date(2010, 6, 28)}, {'date': datetime.date(2011, 12, 2)}, {'date': datetime.date(2010, 7, 24)}, {'date': datetime.date(2010, 7, 4)}, {'date': datetime.date(2013, 9, 28)}, {'date': datetime.date(2007, 3, 17)}, {'date': datetime.date(2013, 11, 7)}, {'date': datetime.date(2012, 8, 13)}, {'date': datetime.date(2009, 2, 19)}, {'date': datetime.date(2007, 3, 17)}, {'date': datetime.date(2011, 11, 15)}, {'date': datetime.date(2011, 12, 26)}, {'date': datetime.date(2010, 2, 14)}, {'date': datetime.date(2014, 4, 16)}, {'date': datetime.date(2010, 2, 28)}, {'date': datetime.date(2007, 11, 2)}, {'date': datetime.date(2008, 5, 17)}, {'date': datetime.date(2013, 11, 18)}, {'date': datetime.date(2010, 11, 14)}, {'date': datetime.date(2007, 8, 19)}, {'date': datetime.date(2012, 5, 11)}, {'date': datetime.date(2007, 10, 27)}, {'date': datetime.date(2009, 11, 14)}, {'date': datetime.date(2009, 4, 19)}, {'date': datetime.date(2007, 7, 14)}, {'date': datetime.date(2012, 5, 7)}, {'date': datetime.date(2014, 5, 4)}, {'date': datetime.date(2012, 6, 16)}, {'date': datetime.date(2012, 4, 25)}, {'date': datetime.date(2012, 6, 25)}, {'date': datetime.date(2010, 10, 14)}, {'date': datetime.date(2013, 4, 18)}, {'date': datetime.date(2013, 10, 6)}, {'date': datetime.date(2009, 8, 2)}, {'date': datetime.date(2008, 9, 23)}, {'date': datetime.date(2011, 11, 26)}, {'date': datetime.date(2010, 1, 22)}, {'date': datetime.date(2012, 6, 23)}, {'date': datetime.date(2013, 11, 20)}, {'date': datetime.date(2008, 7, 6)}, {'date': datetime.date(2009, 4, 12)}, {'date': datetime.date(2011, 12, 28)}, {'date': datetime.date(2012, 1, 22)}, {'date': datetime.date(2009, 1, 26)}, {'date': datetime.date(2012, 1, 13)}, {'date': datetime.date(2010, 9, 26)}, {'date': datetime.date(2013, 11, 14)}, {'date': datetime.date(2010, 7, 22)}, {'date': datetime.date(2013, 3, 17)}, {'date': datetime.date(2008, 10, 18)}, {'date': datetime.date(2008, 12, 9)}, {'date': datetime.date(2012, 1, 14)}, {'date': datetime.date(2012, 6, 28)}, {'date': datetime.date(2011, 10, 5)}, {'date': datetime.date(2007, 5, 19)}, {'date': datetime.date(2013, 1, 24)}, {'date': datetime.date(2008, 6, 28)}, {'date': datetime.date(2008, 5, 16)}, {'date': datetime.date(2013, 5, 8)}, {'date': datetime.date(2007, 7, 23)}, {'date': datetime.date(2010, 8, 4)}, {'date': datetime.date(2011, 10, 18)}, {'date': datetime.date(2011, 6, 2)}, {'date': datetime.date(2009, 5, 23)}, {'date': datetime.date(2010, 10, 14)}, {'date': datetime.date(2013, 7, 17)}, {'date': datetime.date(2008, 5, 19)}, {'date': datetime.date(2008, 3, 19)}, {'date': datetime.date(2010, 8, 14)}, {'date': datetime.date(2012, 6, 19)}, {'date': datetime.date(2013, 8, 12)}, {'date': datetime.date(2013, 7, 6)}, {'date': datetime.date(2014, 10, 11)}, {'date': datetime.date(2012, 7, 1)}, {'date': datetime.date(2013, 11, 5)}, {'date': datetime.date(2009, 11, 6)}, {'date': datetime.date(2009, 4, 19)}, {'date': datetime.date(2008, 8, 12)}, {'date': datetime.date(2012, 8, 8)}, {'date': datetime.date(2009, 8, 12)}, {'date': datetime.date(2012, 5, 27)}, {'date': datetime.date(2011, 9, 15)}, {'date': datetime.date(2013, 12, 19)}, {'date': datetime.date(2007, 12, 7)}, {'date': datetime.date(2008, 3, 4)}, {'date': datetime.date(2013, 9, 16)}, {'date': datetime.date(2009, 11, 22)}, {'date': datetime.date(2014, 9, 19)}, {'date': datetime.date(2008, 10, 20)}, {'date': datetime.date(2010, 12, 16)}, {'date': datetime.date(2013, 3, 15)}, {'date': datetime.date(2012, 4, 25)}, {'date': datetime.date(2009, 5, 10)}]

for x in range(len(T_sub.edges())) :
    a, b = T_sub.edges()[x]
    T_sub.edge[a][b]["date"] = edge_meta[x]["date"]


# Import necessary modules
import matplotlib.pyplot as plt


# Draw the graph to screen
nx.draw(T_sub)
# plt.show()
plt.savefig("_dummyPy024.png", bbox_inches="tight")



# Also need to mock up T
# Use T_sub for these
# Use a list comprehension to get the nodes of interest: noi
noi = [n for n, d in T_sub.nodes(data=True) if d['occupation'] == 'scientist']

# Use a list comprehension to get the edges of interest: eoi
eoi = [(u, v) for u, v, d in T_sub.edges(data=True) if d["date"] < datetime.date(2010, 1, 1)]


# Set the weight of the edge
T_sub.edge[1][10]["weight"] = 2

# Iterate over all the edges (with metadata)
for u, v, d in T_sub.edges(data=True):
    
    # Check if node 293 is involved
    # Make it node 23 instead
    if 23 in [u, v]:
        # Set the weight to 1.1
        T_sub.edge[u][v]["weight"] = 1.1


# Define find_selfloop_nodes()
def find_selfloop_nodes(G):
    """
    Finds all nodes that have self-loops in the graph G.
    """
    nodes_in_selfloops = []
    
    # Iterate over all the edges of G
    for u, v in G.edges():
    # Check if node u and node v are the same
        if u == v:
            # Append node u to nodes_in_selfloops
            nodes_in_selfloops.append(u)
            
    return nodes_in_selfloops

# Check whether number of self loops equals the number of nodes in self loops
# The mock-up above has no self-loops, so this is just for reference on how to find them
assert T_sub.number_of_selfloops() == len(find_selfloop_nodes(T_sub))


# Import nxviz
import nxviz as nv

# Create the MatrixPlot object: m
m = nv.MatrixPlot(T_sub)

# Draw m to the screen
m.draw()

# Display the plot
# plt.show()
plt.savefig("_dummyPy025.png", bbox_inches="tight")


# Convert T to a matrix format: A
A = nx.to_numpy_matrix(T_sub)

# Convert A back to the NetworkX form as a directed graph: T_conv
T_conv = nx.from_numpy_matrix(A, create_using=nx.DiGraph())

# Check that the `category` metadata field is lost from each node
for n, d in T_conv.nodes(data=True):
    assert 'category' not in d.keys()


# Import necessary modules
import matplotlib.pyplot as plt
from nxviz import CircosPlot

# Create the CircosPlot object: c
c = CircosPlot(T_sub)

# Draw c to the screen
c.draw()

# Display the plot
# plt.show()
plt.savefig("_dummyPy026.png", bbox_inches="tight")


# Import necessary modules
from nxviz import ArcPlot

# Create the un-customized ArcPlot object: a
a = ArcPlot(T_sub)

# Draw a to the screen
a.draw()

# Display the plot
# plt.show()
plt.savefig("_dummyPy027.png", bbox_inches="tight")


# Create the customized ArcPlot object: a2
a2 = ArcPlot(T_sub, node_order="category", node_color="category")

# Draw a2 to the screen
a2.draw()

# Display the plot
# plt.show()
plt.savefig("_dummyPy028.png", bbox_inches="tight")
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\networkx\drawing\nx_pylab.py:126: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
##     Future behavior will be consistent with the long-time default:
##     plot commands add elements without first clearing the
##     Axes and/or Figure.
##   b = plt.ishold()
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\networkx\drawing\nx_pylab.py:138: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
##     Future behavior will be consistent with the long-time default:
##     plot commands add elements without first clearing the
##     Axes and/or Figure.
##   plt.hold(b)
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\matplotlib\__init__.py:917: UserWarning: axes.hold is deprecated. Please remove it from your matplotlibrc and/or style files.
##   warnings.warn(self.msg_depr_set % key)
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\matplotlib\rcsetup.py:152: UserWarning: axes.hold is deprecated, will be removed in 3.0
##   warnings.warn("axes.hold is deprecated, will be removed in 3.0")

Example network plot:

Example MatrixPlot (network):

Example CircosPlot (network):

Example ArcPlot (network):

Example ArcPlot (network) colored by category:


Chapter 2 - Important Nodes

Degree centrality - one method of determining important nodes:

  • Being connected to another node makes you a “neighbor” of that node
  • Degree centrality for a node is defined as “# neighbors I have” divided by “number of possible neighbors”
    • Depending on self-loops, the “number of possible neighbors” may or may not include itself
  • Examples of high degree centrality include Twitter broadcasters, airport hubs, disease super-spreaders, and the like
  • Within the “networkx” package, G.neighbors(1) will give a list of all the neighbors of node 1
    • Can instead run nx.degree_centrality(G) # outputs a dictionary of node:centrality; self-loops are not considered

Graph algorithms - path finding for optimization (e.g., shortest path between nodes, information or disease spread, etc.):

  • Breadth-first search (BFS) algorithm first developed in the 1950s for finding the shortest path out of a maze
  • Basically, take one of the points, then find its neighbors, then its neighbors’ neighbors, etc., until the second point is found
  • Use G.edges(), G.nodes(), and automate the search for finding paths between any two given points

Betweeness centrality - including the key concept of “all shortest paths”:

  • All shortest paths is based on finding all shortest paths between all pairs of nodes
  • Betweeness centrality is defined as “# shortest paths running THROUGH node” divide by “all possible shortest paths”
    • n.b. that a node and its neighbor is not counted in the numerator or the denominator; only paths of length 2+ are relevant
  • This helps to identify “bottleneck” nodes - points that if eliminated would significantly slow or even stop connections
    • Can use nx.barbell_graph(m1=, m2=) # m1 will be the size of the barbells, m2 will be the number of connector nodes (zero would just connect a point on each dumbbell)
    • Can use nx.betweenness_centrality(G) to get a dictionary of node:betweeness

Example code includes:


import networkx as nx
import matplotlib.pyplot as plt
import datetime


# DO NOT HAVE Graph T
# Make the same as above
T = nx.DiGraph()

T.add_nodes_from([1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])

T.add_edges_from([(1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8), (1, 9), (1, 10), (1, 11), (1, 12), (1, 13), (1, 14), (1, 15), (1, 16), (1, 17), (1, 18), (1, 19), (1, 20), (1, 21), (1, 22), (1, 23), (1, 24), (1, 25), (1, 26), (1, 27), (1, 28), (1, 29), (1, 30), (1, 31), (1, 32), (1, 33), (1, 34), (1, 35), (1, 36), (1, 37), (1, 38), (1, 39), (1, 40), (1, 41), (1, 42), (1, 43), (1, 44), (1, 45), (1, 46), (1, 47), (1, 48), (1, 49), (16, 48), (16, 18), (16, 35), (16, 36), (18, 16), (18, 24), (18, 35), (18, 36), (19, 35), (19, 36), (19, 5), (19, 8), (19, 11), (19, 13), (19, 15), (19, 48), (19, 17), (19, 20), (19, 21), (19, 24), (19, 37), (19, 30), (19, 31), (28, 1), (28, 5), (28, 7), (28, 8), (28, 11), (28, 14), (28, 15), (28, 17), (28, 20), (28, 21), (28, 24), (28, 25), (28, 27), (28, 29), (28, 30), (28, 31), (28, 35), (28, 36), (28, 37), (28, 44), (28, 48), (28, 49), (36, 24), (36, 35), (36, 5), (36, 37), (37, 24), (37, 35), (37, 36), (39, 1), (39, 35), (39, 36), (39, 38), (39, 33), (39, 40), (39, 41), (39, 45), (39, 24), (42, 1), (43, 48), (43, 35), (43, 36), (43, 37), (43, 24), (43, 29), (43, 47), (45, 1), (45, 39), (45, 41)])

node_meta = [{'occupation': 'scientist', 'category': 'I'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'P'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'P'}]

for x in range(len(T.nodes())) :
    T.node[T.nodes()[x]]["occupation"] = node_meta[x]["occupation"]
    T.node[T.nodes()[x]]["category"] = node_meta[x]["category"]

edge_meta = [{'date': datetime.date(2012, 11, 17)}, {'date': datetime.date(2007, 6, 19)}, {'date': datetime.date(2014, 3, 18)}, {'date': datetime.date(2007, 3, 18)}, {'date': datetime.date(2011, 12, 19)}, {'date': datetime.date(2013, 12, 7)}, {'date': datetime.date(2009, 11, 9)}, {'date': datetime.date(2008, 10, 7)}, {'date': datetime.date(2008, 8, 14)}, {'date': datetime.date(2011, 3, 22)}, {'date': datetime.date(2014, 8, 3)}, {'date': datetime.date(2007, 5, 19)}, {'date': datetime.date(2009, 12, 13)}, {'date': datetime.date(2011, 4, 7)}, {'date': datetime.date(2013, 8, 2)}, {'date': datetime.date(2014, 11, 17)}, {'date': datetime.date(2013, 5, 20)}, {'date': datetime.date(2010, 12, 15)}, {'date': datetime.date(2010, 11, 27)}, {'date': datetime.date(2013, 9, 5)}, {'date': datetime.date(2013, 3, 1)}, {'date': datetime.date(2007, 7, 8)}, {'date': datetime.date(2010, 5, 23)}, {'date': datetime.date(2007, 9, 14)}, {'date': datetime.date(2013, 1, 24)}, {'date': datetime.date(2013, 6, 21)}, {'date': datetime.date(2010, 6, 28)}, {'date': datetime.date(2011, 12, 2)}, {'date': datetime.date(2010, 7, 24)}, {'date': datetime.date(2010, 7, 4)}, {'date': datetime.date(2013, 9, 28)}, {'date': datetime.date(2007, 3, 17)}, {'date': datetime.date(2013, 11, 7)}, {'date': datetime.date(2012, 8, 13)}, {'date': datetime.date(2009, 2, 19)}, {'date': datetime.date(2007, 3, 17)}, {'date': datetime.date(2011, 11, 15)}, {'date': datetime.date(2011, 12, 26)}, {'date': datetime.date(2010, 2, 14)}, {'date': datetime.date(2014, 4, 16)}, {'date': datetime.date(2010, 2, 28)}, {'date': datetime.date(2007, 11, 2)}, {'date': datetime.date(2008, 5, 17)}, {'date': datetime.date(2013, 11, 18)}, {'date': datetime.date(2010, 11, 14)}, {'date': datetime.date(2007, 8, 19)}, {'date': datetime.date(2012, 5, 11)}, {'date': datetime.date(2007, 10, 27)}, {'date': datetime.date(2009, 11, 14)}, {'date': datetime.date(2009, 4, 19)}, {'date': datetime.date(2007, 7, 14)}, {'date': datetime.date(2012, 5, 7)}, {'date': datetime.date(2014, 5, 4)}, {'date': datetime.date(2012, 6, 16)}, {'date': datetime.date(2012, 4, 25)}, {'date': datetime.date(2012, 6, 25)}, {'date': datetime.date(2010, 10, 14)}, {'date': datetime.date(2013, 4, 18)}, {'date': datetime.date(2013, 10, 6)}, {'date': datetime.date(2009, 8, 2)}, {'date': datetime.date(2008, 9, 23)}, {'date': datetime.date(2011, 11, 26)}, {'date': datetime.date(2010, 1, 22)}, {'date': datetime.date(2012, 6, 23)}, {'date': datetime.date(2013, 11, 20)}, {'date': datetime.date(2008, 7, 6)}, {'date': datetime.date(2009, 4, 12)}, {'date': datetime.date(2011, 12, 28)}, {'date': datetime.date(2012, 1, 22)}, {'date': datetime.date(2009, 1, 26)}, {'date': datetime.date(2012, 1, 13)}, {'date': datetime.date(2010, 9, 26)}, {'date': datetime.date(2013, 11, 14)}, {'date': datetime.date(2010, 7, 22)}, {'date': datetime.date(2013, 3, 17)}, {'date': datetime.date(2008, 10, 18)}, {'date': datetime.date(2008, 12, 9)}, {'date': datetime.date(2012, 1, 14)}, {'date': datetime.date(2012, 6, 28)}, {'date': datetime.date(2011, 10, 5)}, {'date': datetime.date(2007, 5, 19)}, {'date': datetime.date(2013, 1, 24)}, {'date': datetime.date(2008, 6, 28)}, {'date': datetime.date(2008, 5, 16)}, {'date': datetime.date(2013, 5, 8)}, {'date': datetime.date(2007, 7, 23)}, {'date': datetime.date(2010, 8, 4)}, {'date': datetime.date(2011, 10, 18)}, {'date': datetime.date(2011, 6, 2)}, {'date': datetime.date(2009, 5, 23)}, {'date': datetime.date(2010, 10, 14)}, {'date': datetime.date(2013, 7, 17)}, {'date': datetime.date(2008, 5, 19)}, {'date': datetime.date(2008, 3, 19)}, {'date': datetime.date(2010, 8, 14)}, {'date': datetime.date(2012, 6, 19)}, {'date': datetime.date(2013, 8, 12)}, {'date': datetime.date(2013, 7, 6)}, {'date': datetime.date(2014, 10, 11)}, {'date': datetime.date(2012, 7, 1)}, {'date': datetime.date(2013, 11, 5)}, {'date': datetime.date(2009, 11, 6)}, {'date': datetime.date(2009, 4, 19)}, {'date': datetime.date(2008, 8, 12)}, {'date': datetime.date(2012, 8, 8)}, {'date': datetime.date(2009, 8, 12)}, {'date': datetime.date(2012, 5, 27)}, {'date': datetime.date(2011, 9, 15)}, {'date': datetime.date(2013, 12, 19)}, {'date': datetime.date(2007, 12, 7)}, {'date': datetime.date(2008, 3, 4)}, {'date': datetime.date(2013, 9, 16)}, {'date': datetime.date(2009, 11, 22)}, {'date': datetime.date(2014, 9, 19)}, {'date': datetime.date(2008, 10, 20)}, {'date': datetime.date(2010, 12, 16)}, {'date': datetime.date(2013, 3, 15)}, {'date': datetime.date(2012, 4, 25)}, {'date': datetime.date(2009, 5, 10)}]

for x in range(len(T.edges())) :
    a, b = T.edges()[x]
    T.edge[a][b]["date"] = edge_meta[x]["date"]



# Define nodes_with_m_nbrs()
def nodes_with_m_nbrs(G, m):
    """
    Returns all nodes in graph G that have m neighbors.
    """
    nodes = set()
    
    # Iterate over all nodes in G
    for n in G.nodes():
        # Check if the number of neighbors of n matches m
        if len(G.neighbors(n)) == m:
            # Add the node n to the set
            nodes.add(n)
    # Return the nodes with m neighbors
    return nodes

# Compute and print all nodes in T that have 3 neighbors
three_nbrs = nodes_with_m_nbrs(T, 3)
print(three_nbrs)


# Compute the degree of every node: degrees
degrees = [len(T.neighbors(n)) for n in T.nodes()]

# Print the degrees
print(degrees)


# Compute the degree centrality of the Twitter network: deg_cent
deg_cent = nx.degree_centrality(T)

# Plot a histogram of the degree centrality distribution of the graph.
plt.figure()
plt.hist(list(deg_cent.values()))
# plt.show()
plt.savefig("_dummyPy029.png", bbox_inches="tight")
plt.clf()

# Plot a histogram of the degree distribution of the graph
plt.figure()
plt.hist(degrees)
# plt.show()
plt.savefig("_dummyPy030.png", bbox_inches="tight")
plt.clf()

# Plot a scatter plot of the centrality distribution and the degree distribution
plt.figure()
plt.scatter(degrees, list(deg_cent.values()))
# plt.show()
plt.savefig("_dummyPy031.png", bbox_inches="tight")
plt.clf()


def path_exists(G, node1, node2):
    """
    This function checks whether a path exists between two nodes (node1, node2) in graph G.
    """
    visited_nodes = set()
    queue = [node1]
    
    for node in queue:  
        neighbors = G.neighbors(node)
        if node2 in neighbors:
            print('Path exists between nodes {0} and {1}'.format(node1, node2))
            return True
            break
            
        else:
            visited_nodes.add(node)
            queue.extend([n for n in neighbors if n not in visited_nodes])
            
        # Check to see if the final element of the queue has been reached
        if node == queue[-1]:
            print('Path does not exist between nodes {0} and {1}'.format(node1, node2))
            
            # Place the appropriate return statement
            return False


# Compute the betweenness centrality of T: bet_cen
bet_cen = nx.betweenness_centrality(T)

# Compute the degree centrality of T: deg_cen
deg_cen = nx.degree_centrality(T)

# Create a scatter plot of betweenness centrality and degree centrality
plt.scatter(list(bet_cen.values()), list(deg_cen.values()))

# Display the plot
# plt.show()
plt.savefig("_dummyPy032.png", bbox_inches="tight")
plt.clf()


# Define find_nodes_with_highest_deg_cent()
def find_nodes_with_highest_deg_cent(G):
    # Compute the degree centrality of G: deg_cent
    deg_cent = nx.degree_centrality(G)
    
    # Compute the maximum degree centrality: max_dc
    max_dc = max(list(deg_cent.values()))
    
    nodes = set()
    
    # Iterate over the degree centrality dictionary
    for k, v in deg_cent.items():
        # Check if the current value has the maximum degree centrality
        if v == max_dc:
            # Add the current node to the set of nodes
            nodes.add(k)
            
    return nodes
    
# Find the node(s) that has the highest degree centrality in T: top_dc
top_dc = find_nodes_with_highest_deg_cent(T)
print(top_dc)

# Write the assertion statement
for node in top_dc:
    assert nx.degree_centrality(T)[node] == max(nx.degree_centrality(T).values())


# Define find_node_with_highest_bet_cent()
def find_node_with_highest_bet_cent(G):
    # Compute betweenness centrality: bet_cent
    bet_cent = nx.betweenness_centrality(G)
    
    # Compute maximum betweenness centrality: max_bc
    max_bc = max(list(bet_cent.values()))
    
    nodes = set()
    
    # Iterate over the betweenness centrality dictionary
    for k, v in bet_cent.items():
        # Check if the current value has the maximum betweenness centrality
        if v == max_bc:
            # Add the current node to the set of nodes
            nodes.add(k)
            
    return nodes

# Use that function to find the node(s) that has the highest betweenness centrality in the network: top_bc
top_bc = find_node_with_highest_bet_cent(T)
print(top_bc)

# Write an assertion statement that checks that the node(s) is/are correctly identified.
for node in top_bc:
    assert nx.betweenness_centrality(T)[node] == max(nx.betweenness_centrality(T).values())
## {45, 37}
## [47, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 4, 15, 0, 0, 0, 0, 0, 0, 0, 0, 22, 0, 0, 0, 0, 0, 0, 0, 4, 3, 0, 9, 0, 0, 1, 7, 0, 3, 0, 0, 0, 0]
## {1}
## {1}

Histogram of degree centrality:

Histogram of degree distribution:

Scatter plot of degree centrality vs degree distribution:

Scatter plot of degree centrality vs between centrality:


Chapter 3 - Structures

Cliques and communities - idea of tightly-knit groups:

  • In network theory, a “clique” is a set of nodes that are full connected to each other by way of an edge
  • Triangle closures are the idea that if A and B are connected and if A and C are connected, but that B and C are not connected, then connecting B and C will form a “clique”
  • A helpful package “itertools” has a function “combinations” that can help to iterate over many combinations (rather than a double for loop)
    • For example, combinations(“ABC”, 2) will create (“A”, “B”), (“A”, “C”), (“B”, “C”)
    • This will be an iterable, but it will not print by itself

Maximal cliques - defined as a clique that when expanded by one node is no longer a clique:

  • Basically, there is no single extension (one extra edge) that would make the clique larger
  • Communities are an expansion of the idea of communities
  • One possible definition of “communities” would be maximal cliques that are of at least size x and that have at least y members in common
  • The find_cliques() function will find all of the maximal cliques in the network data

Sub-graphs - sometimes helpful to view just a small portion of a larger graph:

  • Can use commands such as Gnew = G.subgraph(myNodes) # will just contain the nodes of interest, as well as their edges to each other
  • Can then look at nx.draw(Gnew, with_labels=True) to request that labels be included on the visual

Example code includes:


from itertools import combinations

# Define is_in_triangle() 
def is_in_triangle(G, n):
    """
    Checks whether a node `n` in graph `G` is in a triangle relationship or not. 
    Returns a boolean.
    """
    in_triangle = False
    
    # Iterate over all possible triangle relationship combinations
    for n1, n2 in combinations(G.neighbors(n), 2):
        # Check if an edge exists between n1 and n2
        if G.has_edge(n1, n2):
            in_triangle = True
            break
    return in_triangle

# DO NOT HAVE T (make randomly, minus metadata)
import networkx as nx
import random
import numpy as np
import matplotlib.pyplot as plt

T = nx.Graph()
T.add_nodes_from([x for x in range(1, 31)])
np.random.seed(170530)
n1 = np.random.choice(range(1, 31), size=100, replace=True)
n2 = np.random.choice(range(1, 31), size=100, replace=True)

# Require that first be less than second
edge_list = [(min(x, y), max(x, y)) for x, y in zip(n1, n2) if x != y]
T.add_edges_from(edge_list)


# APPEARS THAT the set() makes sure to keep a sorted and unique list; if a = set(1, 2) and a.add(1) is run, than a will still be {1, 2}
# Can remove items from the set using a.remove() and can add items to the set using a.add()

# Write a function that identifies all nodes in a triangle relationship with a given node.
def nodes_in_triangle(G, n):
    """
    Returns the nodes in a graph `G` that are involved in a triangle relationship with the node `n`.
    """
    triangle_nodes = set([n])
    
    # Iterate over all possible triangle relationship combinations
    for n1, n2 in combinations(G.neighbors(n), 2):
        # Check if n1 and n2 have an edge between them
        if G.has_edge(n1, n2):
            # Add n1 to triangle_nodes
            triangle_nodes.add(n1)
            
            # Add n2 to triangle_nodes
            triangle_nodes.add(n2)
            
    return triangle_nodes
    
# Write the assertion statement
assert len(nodes_in_triangle(T, 1)) == 5  # happens to be what the RNG generated in this case


# Define node_in_open_triangle()
def node_in_open_triangle(G, n):
    """
    Checks whether pairs of neighbors of node `n` in graph `G` are in an 'open triangle' relationship with node `n`.
    """
    in_open_triangle = False
    
    # Iterate over all possible triangle relationship combinations
    for n1, n2 in combinations(G.neighbors(n), 2):
        # Check if n1 and n2 do NOT have an edge between them
        if not G.has_edge(n1, n2):
            in_open_triangle = True
            break
            
    return in_open_triangle

# Compute the number of open triangles in T
num_open_triangles = 0

# Iterate over all the nodes in T
for n in T.nodes():
    
    # Check if the current node is in an open triangle
    if node_in_open_triangle(T, n):
        
        # Increment num_open_triangles
        num_open_triangles += 1
    

print(num_open_triangles)


# Define maximal_cliques()
def maximal_cliques(G, size):
    """
    Finds all maximal cliques in graph `G` that are of size `size`.
    """
    mcs = []
    for clique in nx.find_cliques(G):
        if len(clique) == size:
            mcs.append(clique)
    return mcs

# Check that there are 33 maximal cliques of size 3 in the graph T
assert len(maximal_cliques(T, 3)) == 26  # happens to be what the RNG returns in this case


# Define get_nodes_and_nbrs()
def get_nodes_and_nbrs(G, nodes_of_interest):
    """
    Returns a subgraph of the graph `G` with only the `nodes_of_interest` and their neighbors.
    """
    nodes_to_draw = []
    
    # Iterate over the nodes of interest
    for n in nodes_of_interest:
        # Append the nodes of interest to nodes_to_draw
        nodes_to_draw.append(n)
        
        # Iterate over all the neighbors of node n
        for nbr in G.neighbors(n):
            # Append the neighbors of n to nodes_to_draw
            nodes_to_draw.append(nbr)
        
    return G.subgraph(nodes_to_draw)


# Extract the subgraph with the nodes of interest: T_draw
nodes_of_interest = [8, 24, 26]
T_draw = get_nodes_and_nbrs(T, nodes_of_interest)

# Draw the subgraph to the screen
nx.draw(T_draw, with_labels=True)
# plt.show()
plt.savefig("_dummyPy033.png", bbox_inches="tight")


# Extract the nodes of interest: nodes
node_meta = [{'occupation': 'scientist', 'category': 'I'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'P'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'scientist', 'category': 'P'}]

for x in range(len(T.nodes())) :
    T.node[T.nodes()[x]]["occupation"] = node_meta[x]["occupation"]
    T.node[T.nodes()[x]]["category"] = node_meta[x]["category"]


nodes = [n for n, d in T.nodes(data=True) if d['occupation'] == 'celebrity']

# Create the set of nodes: nodeset
nodeset = set(nodes)

# Iterate over nodes
for n in nodeset:
    
    # Compute the neighbors of n: nbrs
    nbrs = T.neighbors(n)
    
    # Compute the union of nodeset and nbrs: nodeset
    nodeset = nodeset.union(nbrs)


# Compute the subgraph using nodeset: T_sub
T_sub = T.subgraph(nodeset)

# Draw T_sub to the screen
nx.draw(T_sub, with_labels=True)
# plt.show()
plt.savefig("_dummyPy034.png", bbox_inches="tight")
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\networkx\drawing\nx_pylab.py:126: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
##     Future behavior will be consistent with the long-time default:
##     plot commands add elements without first clearing the
##     Axes and/or Figure.
##   b = plt.ishold()
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\networkx\drawing\nx_pylab.py:138: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
##     Future behavior will be consistent with the long-time default:
##     plot commands add elements without first clearing the
##     Axes and/or Figure.
##   plt.hold(b)
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\matplotlib\__init__.py:917: UserWarning: axes.hold is deprecated. Please remove it from your matplotlibrc and/or style files.
##   warnings.warn(self.msg_depr_set % key)
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\matplotlib\rcsetup.py:152: UserWarning: axes.hold is deprecated, will be removed in 3.0
##   warnings.warn("axes.hold is deprecated, will be removed in 3.0")
## 30

Example Sub-graph (anything touching any of [8, 24, 26]:

Example Sub-graph (specified “occupation” in metadata):


Chapter 4 - Case Study

Case study introduction - GitHub collaborator data:

  • The data will be a GitHub use collaboration network
  • The nodes will be the users and the edges will reflect collaboration on 1+ GitHub repositories
  • Goals include 1) analyze structure, 2) visualize graph, and 3) build simple recommendation systems

Case Study Part II - Visualization using the nxviz API:

  • circ = nv.CircosPlot(G) ; circ.draw() # Create the Circos plot
  • Additionally, will use the “connected component subgraph” features of networkx
  • A connected component subragph is defined as a group of nodes connected to each other (perhaps not as a clique; may be through hubs) but with no connection to some other group of nodes
    • nx.connected_component_subgraph(G) # forms a generator object; cast as list to read them

Case Study Part III: Cliques:

  • Simplest clique is an edge
  • Simplest complex clique is a triangle
  • Maximal clique is a clique that cannot be extended just be adding one additional node
  • The nx.find_cliques(G) will find all of the maximal cliques in G

Case Study Part IV: Additional Tasks (building a recommender):

  • Find important users (share with the most other users - degree_centrality)
  • Find largest communities of collaborators (maximal cliques)
  • Build a collaboration recommendation system (open triangles)

Example code includes:


# Import necessary modules
import matplotlib.pyplot as plt
import networkx as nx 
import numpy as np
import random


# DO NOT HAVE Github collaborator graph "G"
# Dummy up the data - 20 each of 2 "flavors"
G = nx.Graph()
G.add_nodes_from([x for x in range(1, 41)])
np.random.seed(170531)

# Add edges for 1-20 with preference that they match to themselves
n1 = np.random.choice(range(1, 21), size=100, replace=True)
n2 = np.random.choice(range(1, 21), size=90, replace=True)
n3 = np.random.choice(range(21, 41), size=10, replace=True)

# Require that first be less than second
edge_list = [(min(x, y), max(x, y)) for x, y in zip(n1, np.append(n2, n3)) if x != y]
G.add_edges_from(edge_list)


# Add edges for 21-40 with preference that they match to themselves
n1 = np.random.choice(range(21, 41), size=50, replace=True)
n2 = np.random.choice(range(21, 41), size=40, replace=True)
n3 = np.random.choice(range(1, 21), size=10, replace=True)

# Require that first be less than second
edge_list = [(min(x, y), max(x, y)) for x, y in zip(n1, np.append(n2, n3)) if x != y]
G.add_edges_from(edge_list)

# Create two groupings for the nodes
node_meta = [{'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}]

for x in range(len(G.nodes())) :
    G.node[G.nodes()[x]]["grouping"] = node_meta[x]["grouping"]



# Plot the degree distribution of the GitHub collaboration network
plt.hist(list(nx.degree_centrality(G).values()))
# plt.show()
plt.savefig("_dummyPy035.png", bbox_inches="tight")
plt.clf()



# Plot the degree distribution of the GitHub collaboration network
plt.hist(list(nx.betweenness_centrality(G).values()))
# plt.show()
plt.savefig("_dummyPy036.png", bbox_inches="tight")
plt.clf()


# Import necessary modules
from nxviz import MatrixPlot


# Calculate the largest connected component subgraph: largest_ccs
largest_ccs = sorted(nx.connected_component_subgraphs(G), key=lambda x: len(x))[-1]

# Create the customized MatrixPlot object: h
h = MatrixPlot(largest_ccs, node_grouping="grouping")

# Draw the MatrixPlot to the screen
h.draw()
# plt.show()
plt.savefig("_dummyPy037.png", bbox_inches="tight")


# Import necessary modules
from nxviz.plots import ArcPlot


# Iterate over all the nodes in G, including the metadata
for n, d in G.nodes(data=True):
    
    # Calculate the degree of each node: G.node[n]['degree']
    G.node[n]['degree'] = nx.degree(G, n)
    
# Create the ArcPlot object: a
a = ArcPlot(G, node_order="degree")

# Draw the ArcPlot to the screen
a.draw()
# plt.show()
plt.savefig("_dummyPy038.png", bbox_inches="tight")


# Import necessary modules
from nxviz import CircosPlot
 
 
# Iterate over all the nodes, including the metadata
for n, d in G.nodes(data=True):
    
    # Calculate the degree of each node: G.node[n]['degree']
    G.node[n]['degree'] = nx.degree(G, n)

# Create the CircosPlot object: c
c = CircosPlot(G, node_order="degree", node_grouping="grouping", node_color="grouping")

# Draw the CircosPlot object to the screen
c.draw()
# plt.show()
plt.savefig("_dummyPy039.png", bbox_inches="tight")


# Calculate the maximal cliques in G: cliques
cliques = nx.find_cliques(G)

# Count and print the number of maximal cliques in G
print(len(list(cliques)))


# Find the author(s) that are part of the largest maximal clique: largest_clique
largest_clique = sorted(nx.find_cliques(G), key=lambda x:len(x))[-1]

# Create the subgraph of the largest_clique: G_lc
G_lc = G.subgraph(largest_clique)

# Create the CircosPlot object: c
c = CircosPlot(G_lc)

# Draw the CircosPlot to the screen
c.draw()
# plt.show()
plt.savefig("_dummyPy040.png", bbox_inches="tight")


# Compute the degree centralities of G: deg_cent
deg_cent = nx.degree_centrality(G)

# Compute the maximum degree centrality: max_dc
max_dc = max(deg_cent.values())

# Find the user(s) that have collaborated the most: prolific_collaborators
prolific_collaborators = [n for n, dc in deg_cent.items() if dc == max_dc]

# Print the most prolific collaborator(s)
print(prolific_collaborators)


# Identify the largest maximal clique: largest_max_clique
largest_max_clique = set(sorted(nx.find_cliques(G), key=lambda x: len(x))[-1])

# Create a subgraph from the largest_max_clique: G_lmc
G_lmc = G.subgraph(largest_max_clique)

# Go out 1 degree of separation
for node in G_lmc.nodes():
    G_lmc.add_nodes_from(G.neighbors(node))
    G_lmc.add_edges_from(zip([node]*len(G.neighbors(node)), G.neighbors(node)))

# Record each node's degree centrality score
for n in G_lmc.nodes():
    G_lmc.node[n]['degree centrality'] = nx.degree_centrality(G_lmc)[n]
        
# Create the ArcPlot object: a
a = ArcPlot(G_lmc, node_order = "degree centrality")

# Draw the ArcPlot to the screen
a.draw()
# plt.show()
plt.savefig("_dummyPy041.png", bbox_inches="tight")


# Import necessary modules
from itertools import combinations
from collections import defaultdict

# Initialize the defaultdict: recommended
recommended = defaultdict(int)

# Iterate over all the nodes in G
for n, d in G.nodes(data=True):
    
    # Iterate over all possible triangle relationship combinations
    for n1, n2 in combinations(G.neighbors(n), 2):
        
        # Check whether n1 and n2 do not have an edge
        if not G.has_edge(n1, n2):
            
            # Increment recommended
            recommended[(n1, n2)] += 1


# Identify the top 10 pairs of users
all_counts = sorted(recommended.values())
top10_pairs = [pair for pair, count in recommended.items() if count > all_counts[-10]]
print(top10_pairs)
## 75
## [6]
## [(3, 5), (6, 8), (18, 1), (6, 2)]

Case study - degree distribution:

Case study - betweenness centrality:

Case study - MatrixPlot:

Case study - ArcPlot:

Case study - CircosPlot:

Case Study - CircosPlot (for largest clique):

Case Study - ArcPlot (ordered by degree centrality):

Network Analysis in Python (Part II)

Chapter 1 - Bipartite Graphs and Recommendation Systems

Definitions and basic recap:

  • Network/Graph is comprised of two sets - nodes and edges (connections between nodes)
  • Undirected graphs (e.g., Facebook) have connections without any direction (“A connects B” and “B connects A” both mean the same thing)
  • Directed graphs (e.g., Twitter) have connections with direction (“A connects B” can happen even in the absence of “B connects A”)
  • The “networkx” package has a library of functions and API for working with networks and graphs
    • Typically, this is called using “import networkx as nx”
    • For a given graph G, can call G.nodes() and G.edges() to get lists of the relevant data
    • Running type(G) can return Graph, DiGraph, MultiGraph, and MultiDiGraph
  • The “nxviz” package has a simple API for visualization of networks in a rational manner
    • Typically, this is called using “import nxviz as nv”
    • Matrix, Arc, Circos, Hive - CircosPlot will be the most used in this course
    • c = nv.CircosPlot(G)
    • c.draw()
    • plt.show()

Bipartite graphs - graphs that are partitioned in to two sets, and with nodes that are only connected to nodes in other partitions:

  • Example might be customer nodes and product nodes, where the only connections will be customer/product
  • Typically, bipartite data will be encoded in the nodes using the “bipartite” keyword
    • G.add_nodes_from(myListCustomers, bipartite=“customers”)
    • G.add_nodes_from(myListProducts, bipartite=“products”)
    • G.nodes(data=True) # will show the nodes, including their attribute “bipartite”
  • “Degree centrality” is defined as # Neighbors / # Possible Neighbors
    • For a bipartite graph, the “Possible Neighbors” are only the nodes in the OTHER partition
    • Typically, list comprehensions are used to gather just the nodes that have one “bipartite”: or the other
    • Can then use this list of nodes from one partition with “nx.bipartite.degree_centrality(G, cust_nodes)”, where cust_nodes is the list of nodes in ONE of the partitions

Bipartite graphs and recommendation systems - extending from unipartite (users-only) to bipartite (repo-users):

  • This is a means of recommending repositories for users to work on ; based on set overlap
  • Often helpful to use node sets in code for these
    • The command G.neighbors(“myNode”) returns a list of all the neighbors of “myNode”
    • With two lists of nodes, could run set(nodeNeighbor1).intersection(nodeNeighbor3)
    • Alternately, set(nodeNeighbor3).difference(nodeNeighbor1) will return items in nodeNeighbor3 but not in nodeNeighbor1

Example code includes:


# Import necessary modules
import networkx as nx
from nxviz import CircosPlot
import matplotlib.pyplot as plt


# Create a dummy bipartite graph
bpG = nx.Graph()

# Create 200 nodes of projects and 1000 nodes of users
bpProj = list(range(200))
bpUser = list(range(200, 1200))
bpG.add_nodes_from(bpProj, bipartite="projects")
bpG.add_nodes_from(bpUser, bipartite="users")


# Create roughly 10,000 edges (out of 200,000 possible)
# Make some projects much more prolific than others
# Project likelihood = 0.5 for [0-3], 0.25 for [4-9], and Proj / 3000 for all others
import random

for a in bpProj:
    for b in bpUser:
        # Set up the core likelihood based on project (a)
        if a <= 3: baseLike = 0.5
        elif a <= 9: baseLike = 0.25
        else: baseLike = (210-a) / 3000
        
        # Scale the likelihood by user
        finalLike = baseLike * (b + 300) / 1000
        if finalLike > random.random():
            bpG.add_edge(a, b)


# Add the degree centrality score of each node to their metadata dictionary
G = bpG.copy()
dcs = nx.degree_centrality(G)
for n in G.nodes():
    G.node[n]['centrality'] = dcs[n]

# Create the CircosPlot object: c
c = CircosPlot(G, node_color="bipartite", node_grouping="bipartite", node_order="centrality")

# Draw c to the screen (too time-consuming - large graphic)
# c.draw()

# Display the plot
# plt.show()


# Define get_nodes_from_partition()
def get_nodes_from_partition(G, partition):
    # Initialize an empty list for nodes to be returned
    nodes = []
    # Iterate over each node in the graph G
    for n in G.nodes():
        # Check that the node belongs to the particular partition
        if G.node[n]['bipartite'] == partition:
            # If so, append it to the list of nodes
            nodes.append(n)
    return nodes

# Print the number of nodes in the 'projects' partition
print(len(get_nodes_from_partition(G, "projects")))

# Print the number of nodes in the 'users' partition
print(len(get_nodes_from_partition(G, "users")))


# Import matplotlib
import matplotlib.pyplot as plt

# Get the 'users' nodes: user_nodes
user_nodes = get_nodes_from_partition(G, "users")

# Compute the degree centralities: dcs
dcs = nx.degree_centrality(G)

# Get the degree centralities for user_nodes: user_dcs
user_dcs = [dcs[n] for n in user_nodes]

# Plot the degree distribution of users_dcs
# plt.yscale('log')
plt.hist(user_dcs, bins=20)
# plt.show()
plt.savefig("_dummyPy077.png", bbox_inches="tight")
plt.clf()


# Get the 'projects' nodes: project_nodes
project_nodes = [n for n, d in G.nodes(data=True) if d["bipartite"] == "projects"]

# Compute the degree centralities: dcs
dcs = nx.degree_centrality(G)

# Get the degree centralities for project_nodes: project_dcs
project_dcs = [dcs[n] for n in project_nodes]

# Plot the degree distribution of project_dcs
# plt.yscale('log')
plt.hist(project_dcs, bins=20)
# plt.show()
plt.savefig("_dummyPy078.png", bbox_inches="tight")
plt.clf()


def shared_partition_nodes(G, node1, node2):
    # Check that the nodes belong to the same partition
    assert G.node[node1]['bipartite'] == G.node[node2]['bipartite']
    
    # Get neighbors of node 1: nbrs1
    nbrs1 = G.neighbors(node1)
    # Get neighbors of node 2: nbrs2
    nbrs2 = G.neighbors(node2)
    
    # Compute the overlap using set intersections
    overlap = set(nbrs1).intersection(nbrs2)
    return overlap

# Print the number of shared repositories between users 'u7909' and 'u2148'
print(len(shared_partition_nodes(G, 1050, 1100)))


def user_similarity(G, user1, user2, proj_nodes):
    # Check that the nodes belong to the 'users' partition
    assert G.node[user1]['bipartite'] == 'users'
    assert G.node[user2]['bipartite'] == 'users'
    
    # Get the set of nodes shared between the two users
    shared_nodes = shared_partition_nodes(G, user1, user2)
    
    # Return the fraction of nodes in the projects partition
    return len(shared_nodes) / len(proj_nodes)

# Compute the similarity score between users 'u4560' and 'u1880'
project_nodes = get_nodes_from_partition(G, "projects")
similarity_score = user_similarity(G, 1050, 1100, project_nodes)

print(similarity_score)


from collections import defaultdict

def most_similar_users(G, user, user_nodes, proj_nodes):
    # Data checks
    assert G.node[user]['bipartite'] == 'users'
    
    # Get other nodes from user partition
    user_nodes = set(user_nodes)
    user_nodes.remove(user)
    
    # Create the dictionary: similarities
    similarities = defaultdict(list)
    for n in user_nodes:
        similarity = user_similarity(G, user, n, proj_nodes)
        similarities[similarity].append(n)
    
    # Compute maximum similarity score: max_similarity
    max_similarity = max(similarities.keys())
    
    # Return list of users that share maximal similarity
    return similarities[max_similarity]

user_nodes = get_nodes_from_partition(G, 'users')
project_nodes = get_nodes_from_partition(G, 'projects')

print(most_similar_users(G, 1050, user_nodes, project_nodes))


def recommend_repositories(G, from_user, to_user):
    # Get the set of repositories that from_user has contributed to
    from_repos = set(G.neighbors(from_user))
    # Get the set of repositories that to_user has contributed to
    to_repos = set(G.neighbors(to_user))
    
    # Identify repositories that the from_user is connected to that the to_user is not connected to
    return from_repos.difference(to_repos)

# Print the repositories to be recommended
print(recommend_repositories(G, 1050, 1100))
## 200
## 1000
## 2
## 0.01
## [960]
## {0, 33, 32, 101, 5, 41, 50, 116, 23, 59}

User Degree Centrality:

Project Degree Centrality:

Chapter 2 - Graph projections

Concept of projection - investigating relationships between nodes in one partition CONDITIONAL on connections to nodes in the other partition:

  • An example would be that if customer1 and customer2 both bought product3, then customer1 and customer2 could be considered related by means of product3
  • Graphs on Disk - flat edge lists ; CSV files (nodelist + metadata , edgelist + metadata)
    • Text file is space-delimited containing Person Club {metadata}
    • A text file like the above can be read in directly using nx.read_edgelist(“myTextFile.txt”)
  • Can create a projection provided that one of the node types has already been created as a list
    • G_cust = nx.bipartite.projected_graph(G, cust_nodes)
    • G will have nodes as per cust_nodes and edges as per all common relationships found to the other partition
  • Can generate degree centrality in any of three manners
    • nx.bipartite.degree_centrality(G, cust_nodes) # denominator is only the OTHER partition
    • nx.degree_centrality(G) # denominator is everyone other than myself
    • nx.degree_centrality(G_cust) # denominator is everyone in MY partition

Bipartite graphs as matrices - modifications to the MatrixPlot methodology for bipartite graphs:

  • Rows - nodes in one partition
  • Columns - nodes in other partition
  • Cells - 1 if edge exists, 0 if not
  • Can be generated using mat = nx.bipartite.biadjacency_matrix(G, row_order=cust_nodes, column_order = prod_nodes)
    • The row_order and column_order are optional, passed if there is reason to want the row/column sorted in a particular manner
    • The matrix returned is of type Sparse Matrix (memory-efficient matrices)
  • The projection can be calculated as mat %*% t(mat), where mat is per above
    • The diagonals show the number of connections within the own partition for that node
    • The off-diagonals show the projection based on connections to the other partition
    • Can also run t(mat) %*% mat for the results based on the other of the two partitions
  • Matrix multiplication in Python is run using the “@” operator
    • mat.T is the transpose of mat in Python

Representing network data with pandas - human-readable graphs with CSV:

  • CSV files have many advantages - human-readable, easy for pandas to work with
    • Primary disadvantage is disk space, as there is repetition that might be avoided in the binary coding for a network/graph
  • Node list - each row is one node, with each column being some of the metadata associated to that node
  • Edge list - each row is one edge, with each column being some of the metadata associated to that edge
  • For pandas, take the data in G.nodes(data=True) and create a dictionary for each node, with the dictionaries being in a list
    • Can use the command dictA.update(dictB) and it will add the elements of dictB to dictA (seems like “append” for dictionaries)
    • Can then run pd.DataFrame(nodeList) and a data frame will be created

Example code includes:


myPath = "./PythonInputFiles/"



import pandas as pd



# Import networkx
import networkx as nx

# Read in the data: g (create from the CSV instead)
# G = nx.read_edgelist('american-revolution.edgelist')

# downloaded file from DataCamp
revMap = pd.read_csv(myPath + "revolution_network_flatfile.txt", index_col=0)
revPeople = list(revMap.index)
revClubs = list(revMap.columns)

G = nx.Graph()
G.add_nodes_from(revPeople + revClubs)

for m in range(revMap.shape[0]):
    for n in range(revMap.shape[1]):
        if revMap.iloc[m, n] == 1: G.add_edge(revPeople[m], revClubs[n])

# Assign nodes to 'clubs' or 'people' partitions
for n, d in G.nodes(data=True):
    if '.' in n:
        G.node[n]["bipartite"] = 'people'
    else:
        G.node[n]["bipartite"] = 'clubs'

# Print the edges of the graph (too many - skip)
# print(G.edges())


# Prepare the nodelists needed for computing projections: people, clubs
people = [n for n in G.nodes() if G.node[n]["bipartite"] == 'people']
clubs = [n for n, d in G.nodes(data=True) if d['bipartite'] == 'clubs']

# Compute the people and clubs projections: peopleG, clubsG
peopleG = nx.bipartite.projected_graph(G, people)
clubsG = nx.bipartite.projected_graph(G, clubs)


import matplotlib.pyplot as plt 

# Plot the degree centrality distribution of both node partitions from the original graph
plt.figure()
original_dc = nx.bipartite.degree_centrality(G, people)
plt.hist(list(original_dc.values()), alpha=0.5)
plt.yscale('log')
plt.title('Bipartite degree centrality')
# plt.show()
plt.savefig("_dummyPy079.png", bbox_inches="tight")
plt.clf()


# Plot the degree centrality distribution of the peopleG graph
plt.figure()  
people_dc = nx.degree_centrality(peopleG)
plt.hist(list(people_dc.values()))
plt.yscale('log')
plt.title('Degree centrality of people partition')
# plt.show()
plt.savefig("_dummyPy080.png", bbox_inches="tight")
plt.clf()


# Plot the degree centrality distribution of the clubsG graph
plt.figure() 
clubs_dc = nx.degree_centrality(clubsG)
plt.hist(list(clubs_dc.values()))
plt.yscale('log')
plt.title('Degree centrality of clubsG partition')
# plt.show()
plt.savefig("_dummyPy081.png", bbox_inches="tight")
plt.clf()


# Copied function from above
# Define get_nodes_from_partition()
def get_nodes_from_partition(G, partition):
    # Initialize an empty list for nodes to be returned
    nodes = []
    # Iterate over each node in the graph G
    for n in G.nodes():
        # Check that the node belongs to the particular partition
        if G.node[n]['bipartite'] == partition:
            # If so, append it to the list of nodes
            nodes.append(n)
    return nodes

# Get the list of people and list of clubs from the graph: people_nodes, clubs_nodes
people_nodes = get_nodes_from_partition(G, "people")
clubs_nodes = get_nodes_from_partition(G, "clubs")


# Requires scipy::sparse
# Compute the biadjacency matrix: bi_matrix
# bi_matrix = nx.bipartite.biadjacency_matrix(G, row_order=people_nodes, column_order=clubs_nodes)

# Compute the user-user projection: user_matrix
# user_matrix = bi_matrix @ bi_matrix.T

# print(user_matrix)


import numpy as np

# Find out the names of people who were members of the most number of clubs
# diag = user_matrix.diagonal() 
# indices = np.where(diag == diag.max())[0]  
# print('Number of clubs: {0}'.format(diag.max()))
# print('People with the most number of memberships:')
# for i in indices:
#     print('- {0}'.format(people_nodes[i]))

# Set the diagonal to zero and convert it to a coordinate matrix format
# user_matrix.setdiag(0)
# users_coo = user_matrix.tocoo()

# Find pairs of users who shared membership in the most number of clubs
# indices = np.where(users_coo.data == users_coo.data.max())[0]
# print('People with most number of shared memberships:')
# for idx in indices:
#     print('- {0}, {1}'.format(people_nodes[users_coo.row[idx]], people_nodes[users_coo.col[idx]]))  


# Initialize a list to store each edge as a record: nodelist
nodelist = []
for n, d in peopleG.nodes(data=True):
    # nodeinfo stores one "record" of data as a dict
    nodeinfo = {'person': n} 
    
    # Update the nodeinfo dictionary 
    nodeinfo.update(d)
    
    # Append the nodeinfo to the node list
    nodelist.append(nodeinfo)
    

# Create a pandas DataFrame of the nodelist: node_df
node_df = pd.DataFrame(nodelist)
print(node_df.head())


# Initialize a list to store each edge as a record: edgelist
edgelist = []
for n1, n2, d in peopleG.edges(data=True):
    # Initialize a dictionary that shows edge information: edgeinfo
    edgeinfo = {'node1':n1, 'node2':n2}
    
    # Update the edgeinfo data with the edge metadata
    edgeinfo.update(d)
    
    # Append the edgeinfo to the edgelist
    edgelist.append(edgeinfo)

# Create a pandas DataFrame of the edgelist: edge_df
edge_df = pd.DataFrame(edgelist)
print(edge_df.head())
##   bipartite              person
## 0    people          Adams.John
## 1    people        Adams.Samuel
## 2    people            Allen.Dr
## 3    people  Appleton.Nathaniel
## 4    people         Ash.Gilbert
##         node1             node2
## 0  Adams.John      Story.Elisha
## 1  Adams.John        Bass.Henry
## 2  Adams.John     Edes.Benjamin
## 3  Adams.John    Champney.Caleb
## 4  Adams.John  Holmes.Nathaniel

Revolutionary War Connectivity (Bipartite):

Revolutionary War Connectivity (People):

Revolutionary War Connectivity (Clubs):


Chapter 3 - Comparing Graphs and Time-Dynamics

Introduction to graph differences - time series analysis (graph statistic over time):

  • Evolving graphs - example is a communication network, which dynamically changes over time
    • Could be constant nodes with evolving edges - easier to analyze
    • Could be evolving nodes with evolving edges
  • Frequently, the simplifying assumption is made of constant nodes with evolving edges - so, “differences” means “change in edge sets”
    • set(a).difference(set(b)) # will identify any items in a that are not in b
    • Can also run nx.difference(G1, G2), which assumes constant nodes but evolving edges (will do things in G1 but not in G2)

Evolving graph statistics - essentially, change in summary statistics (edges, degree centrality, degree distribution, activity spikes, etc.) over time:

  • For simple metrics, can use the edgelist data
  • For graph theoretic metrics, use the graph object
  • The ECDF (cumulative distribution function - normed and cumulative) is a compact way to represent many time periods in a single plot

Zooming in and zooming out - overall graph summary vs. local graph summary:

  • Zooming on nodes involves 1) isolating a given set of nodes, and 2) plotting a node statistic over time
  • Example would be to analyze how purchasing patterns have changed over time
    • Suppose that we want to run a deep-dive on noi=“customer1” over a set of graphs Gs[.]
    • degs = []
    • for g in Gs : degs.append(len(g.neighbors(noi)))
  • Default dictionaries - defaultdict()
    • from collections import defaultdict
    • d = defaultdict(list)
    • d[“heathrow”].append(0.31) ; d[“heathrow”].append(0.84)
    • d # defaultdict(, {“heathrow” : [0.31, 0.84]})
    • The main difference of a defaultdict() vs a dict() is that defaultdict() allows for appending to a key that does not yet exist (will create as the first item) rather than crashing out

Example code includes:


myPath = "./PythonInputFiles/"


# Dummy up some data with senders and recipients, each being paired to each other at different times
sendList = ["S" + str(x) for x in range(200)]
receiveList = ["R" + str(x) for x in range(200)]

# Allow for 20,000 messages total, pulled at random and with random months
import numpy as np
import pandas as pd

dataSend = np.random.choice(sendList, 20000)
dataReceive = np.random.choice(receiveList, 20000)
dataMonth = np.random.choice(range(4, 11), 20000, p=[0.025, 0.05, 0.075, 0.125, 0.175, 0.225, 0.325])

data = pd.DataFrame( {"sender":dataSend, "recipient":dataReceive, "month":dataMonth} ).drop_duplicates()
print(data.info())
print(data.head())



import networkx as nx

months = range(4, 11)

# Initialize an empty list: Gs
Gs = []

for month in months:
    # Instantiate a new undirected graph: G
    G = nx.Graph()
    
    # Add in all nodes that have ever shown up to the graph
    G.add_nodes_from(data["sender"])
    G.add_nodes_from(data["recipient"])
    
    # Filter the DataFrame so that there's only the given month
    df_filtered = data[data['month'] == month]
    
    # Add edges from filtered DataFrame
    G.add_edges_from(zip(df_filtered["sender"], df_filtered["recipient"]))
    
    # Append G to the list of graphs
    Gs.append(G)

print(len(Gs))


import networkx as nx  
# Instantiate a list of graphs that show edges added: added
added = []
# Instantiate a list of graphs that show edges removed: removed
removed = []
# Here's the fractional change over time
fractional_changes = []
window = 1  
i = 0      

for i in range(len(Gs) - window):
    g1 = Gs[i]
    g2 = Gs[i + window]
    
    # Compute graph difference here
    added.append(nx.difference(g2, g1))  
    removed.append(nx.difference(g1, g2))
    
    # Compute change in graph size over time
    fractional_changes.append((len(g2.edges()) - len(g1.edges())) / len(g1.edges()))
    
    i += 1

# Print the fractional change
print(fractional_changes)


# Import matplotlib
import matplotlib.pyplot as plt

fig = plt.figure()
ax1 = fig.add_subplot(111)

# Plot the number of edges added over time
edges_added = [len(g.edges()) for g in added]
plot1 = ax1.plot(edges_added, label='added', color='orange')

# Plot the number of edges removed over time
edges_removed = [len(g.edges()) for g in removed]
plot2 = ax1.plot(edges_removed, label='removed', color='purple')

# Set yscale to logarithmic scale
ax1.set_yscale('log')  
ax1.legend()

# 2nd axes shares x-axis with 1st axes object
ax2 = ax1.twinx()

# Plot the fractional changes over time
plot3 = ax2.plot(fractional_changes, label='fractional change', color='green')

# Here, we create a single legend for both plots
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines1 + lines2, labels1 + labels2, loc=0)
plt.axhline(0, color='green', linestyle='--')
# plt.show()
plt.savefig("_dummyPy082.png", bbox_inches="tight")
plt.clf()


# Import matplotlib
import matplotlib.pyplot as plt

fig = plt.figure()

# Create a list of the number of edges per month
edge_sizes = [len(g.edges()) for g in Gs]

# Plot edge sizes over time
plt.plot(edge_sizes)
plt.xlabel('Time elapsed from first month (in months).') 
plt.ylabel('Number of edges')                           
# plt.show() 
plt.savefig("_dummyPy083.png", bbox_inches="tight")
plt.clf()


# Create a list of degree centrality scores month-by-month
cents = []
for G in Gs:
    cent = nx.degree_centrality(G)
    cents.append(cent)

# Create a function for ECDF (copy from statistics)
def ECDF(myData):
    """Compute ECDF for a one-dimensional array of measurements."""
    
    # Number of data points: n
    n = len(myData)
    
    # x-data for the ECDF: x
    x = np.sort(myData)
    
    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n
    
    return x, y


# Plot ECDFs over time
fig = plt.figure()
for i in range(len(cents)):
    x, y = ECDF(list(cents[i].values()))
    plt.plot(x, y, label='Month {0}'.format(i+1)) 

plt.legend()   
# plt.show()
plt.savefig("_dummyPy084.png", bbox_inches="tight")
plt.clf()


# Get the top 5 unique degree centrality scores: top_dcs
top_dcs = sorted(set(nx.degree_centrality(G).values()), reverse=True)[0:5]

# Create list of nodes that have the top 5 highest overall degree centralities
top_connected = []
for n, dc in nx.degree_centrality(G).items():
    if dc in top_dcs:
        top_connected.append(n)

# Print the number of nodes that share the top 5 degree centrality scores
print(len(top_connected))


# Import necessary modules
import matplotlib.pyplot as plt
from collections import defaultdict

# Create a defaultdict in which the keys are nodes and the values are a list of connectivity scores over time
connectivity = defaultdict(list)
for n in top_connected:
    for g in Gs:
        connectivity[n].append(len(g.neighbors(n)))

# Plot the connectivity for each node
fig = plt.figure() 
for n, conn in connectivity.items(): 
    plt.plot(conn, label=n) 

plt.legend()  
# plt.show()
plt.savefig("_dummyPy085.png", bbox_inches="tight")
plt.clf()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 19033 entries, 0 to 19999
## Data columns (total 3 columns):
## month        19033 non-null int32
## recipient    19033 non-null object
## sender       19033 non-null object
## dtypes: int32(1), object(2)
## memory usage: 371.7+ KB
## None
##    month recipient sender
## 0      9      R157    S62
## 1      9       R82    S75
## 2      7        R1    S75
## 3     10      R186   S176
## 4     10       R34    S64
## 7
## [1.1483050847457628, 0.47337278106508873, 0.5957161981258366, 0.40226510067114096, 0.26174095124139996, 0.4480796586059744]
## 27

Evolution of Edges Added and Deleted:

Evolution of Total Number of Edges:

Evolution of ECDF for Degree Centrality:

Evolution of Edges for Most Connected Users:


Chapter 4 - Case Study (Tying Together)

Introduction to the dataset - 6 months worth of student postings to college forums:

  • The nodes are bipartite (students and forums), with edges existing when a student posts to a forum
    • Constructing a graph from a pandas DataFrame
    • Constructing unipartite projections of a bipartite graph
    • Visualization using CircosPlot
    • Time series filtering and analysis
  • Graphs from DataFrames - G = nx.Graph() ; G.add_nodes_from(df[“products”], bipartite=“products”)
    • G.add_edges_from(zip(df[“customers”], df[“products”]))
  • Bipartite projections to unipartite - cust_nodes = [n for n in G.nodes() if G.node[n][“bipartite”] == “customers”]
    • nx.bipartite.projected_graph(g, nodes = cust_nodes)

Time-based filtering - including the datetime module:

  • Filtering edges - for example, [(u, v) for u, v, d in G.edges(data=True) if d[“sale count”] >= 10]
  • Using datetime - from datetime import datetime, timedelta
    • date1 = datetime(2011, 11, 6) # will form 2011-11-6
  • Plotting - c = CircosPlot(G, node_grouping=“bipartite”, node_color=“bipartite”) ; c.draw() ; plt.show()

Time series analysis - global vs. local (point in time) analysis:

  • The timedelta object from datetime allows for calculating time between two datetime
    • td=timedelta(4) ; datetime.datetime(2011, 11, 10, 0, 0) + td = datetime.datetime(2011, 11, 14, 0, 0)
  • Degree centrality is nx.degree_centrality(G)
    • For the bipartite (denominator only the OTHER partition), use nx.bipartite.degree_centrality(G, nodeType1List)

Example code includes:


myPath = "./PythonInputFiles/"



import pickle
import pandas as pd
import networkx as nx


# Downloaded key file "uci-forum.p" from DataCamp
with open(myPath + "uci-forum.p", "rb") as pFile: tmpGraph = pickle.load(pFile)

# Convert to a DataFrame with student-forum-date
# Grab the student-forum-date matches from the edges
edgeData = tmpGraph.edges(data=True)
studentData = [x[0] if x[0][0]=="s" else x[1] for x in edgeData]
forumData = [x[1] if x[0][0]=="s" else x[0] for x in edgeData]
dateData = pd.to_datetime([x[2]["date"] for x in edgeData])

# Check that the student data matches the nodes with bipartite="student"
studentNodes = set([x for x in tmpGraph.nodes() if tmpGraph.node[x]["bipartite"]=="student"])
forumNodes = set([x for x in tmpGraph.nodes() if tmpGraph.node[x]["bipartite"]=="forum"])

studentNodes.difference(set(studentData))
forumNodes.difference(set(forumData))
set(studentData).difference(studentNodes)
set(forumData).difference(forumNodes)

# Create the DataFrame
data = pd.DataFrame( {"student":studentData, "forum":forumData, "date":dateData} )
data.info()
data.head(10)




# Instantiate a new Graph: G
G = nx.Graph()

# Add nodes from each of the partitions
G.add_nodes_from(data["student"], bipartite="student")
G.add_nodes_from(data["forum"], bipartite="forum")

# Add in each edge along with the date the edge was created
for r, d in data.iterrows():
    G.add_edge(d["student"], d["forum"], date=d["date"])


# Import necessary modules
import matplotlib.pyplot as plt
import networkx as nx

# Get the student partition's nodes: student_nodes
student_nodes = [n for n, d in G.nodes(data=True) if d['bipartite'] == 'student']

# Create the students nodes projection as a graph: G_students
G_students = nx.bipartite.projected_graph(G, nodes=student_nodes)

# Calculate the degree centrality using nx.degree_centrality: dcs
dcs = nx.degree_centrality(G_students)

# Plot the histogram of degree centrality values
plt.hist(list(dcs.values()))
plt.yscale('log')  
# plt.show() 
plt.savefig("_dummyPy086.png", bbox_inches="tight")
plt.clf()


# Import necessary modules
import matplotlib.pyplot as plt 
import networkx as nx

# Get the forums partition's nodes: forum_nodes
forum_nodes = [n for n, d in G.nodes(data=True) if d["bipartite"] == "forum"]

# Create the forum nodes projection as a graph: G_forum
G_forum = nx.bipartite.projected_graph(G, forum_nodes)

# Calculate the degree centrality using nx.degree_centrality: dcs
dcs = nx.degree_centrality(G_forum)

# Plot the histogram of degree centrality values
plt.hist(list(dcs.values()))
plt.yscale('log') 
# plt.show()  
plt.savefig("_dummyPy087.png", bbox_inches="tight")
plt.clf()


import networkx as nx
from datetime import datetime

# Instantiate a new graph: G_sub
G_sub = nx.Graph()

# Add nodes from the original graph
G_sub.add_nodes_from(G.nodes(data=True))

# Add edges using a list comprehension with one conditional on the edge dates, that the date of the edge is earlier than 2004-05-16.
G_sub.add_edges_from([(u, v, d) for u, v, d in G.edges(data=True) if d['date'] < datetime(2004, 5, 16)])


# Import necessary modules
from nxviz import CircosPlot
import networkx as nx
import matplotlib.pyplot as plt

# Compute degree centrality scores of each node
dcs = nx.bipartite.degree_centrality(G, nodes=forum_nodes)
for n, d in G_sub.nodes(data=True):
    G_sub.node[n]['dc'] = dcs[n]

# Create the CircosPlot object: c
c = CircosPlot(G_sub, node_color = "bipartite", node_grouping="bipartite", node_order="dc")

# Draw c to screen
c.draw()

# Display the plot
# plt.show() 
plt.savefig("_dummyPy088.png", bbox_inches="tight")
plt.clf()


from datetime import datetime
dayone = datetime(2004, 5, 14, 0, 0)
lastday = datetime(2004, 10, 26, 0, 0)

# Import necessary modules
from datetime import timedelta  
import matplotlib.pyplot as plt

# Define current day and timedelta of 2 days
curr_day = dayone
td = timedelta(days=2)

# Initialize an empty list of posts by day
n_posts = []
while curr_day < lastday:
    if curr_day.day == 1:
        print(curr_day) 
    # Filter edges such that they are within the sliding time window: edges
    edges = [(u, v, d) for u, v, d in G.edges(data=True) if d['date'] >= curr_day and d['date'] < curr_day + td]
    
    # Append number of edges to the n_posts list
    n_posts.append(len(edges))
    
    # Increment the curr_day by the time delta
    curr_day += td

# Create the plot
plt.plot(n_posts)  
plt.xlabel('Days elapsed')
plt.ylabel('Number of posts')
# plt.show()  
plt.savefig("_dummyPy089.png", bbox_inches="tight")
plt.clf()


from datetime import datetime, timedelta
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

# Initialize a new list: mean_dcs
mean_dcs = []
curr_day = dayone
td = timedelta(days=2)

while curr_day < lastday:
    if curr_day.day == 1:
        print(curr_day)  
    # Instantiate a new graph containing a subset of edges: G_sub
    G_sub = nx.Graph()
    # Add nodes from G
    G_sub.add_nodes_from(G.nodes(data=True))
    # Add in edges that fulfill the criteria
    G_sub.add_edges_from([(u, v, d) for u, v, d in G.edges(data=True) if d['date'] >= curr_day and d['date'] < curr_day + td])
    
    # Get the students projection
    G_student_sub = nx.bipartite.projected_graph(G_sub, student_nodes)
    # Compute the degree centrality of the students projection
    dc = nx.degree_centrality(G_student_sub)
    # Append mean degree centrality to the list mean_dcs
    mean_dcs.append(np.mean(list(dc.values())))
    # Increment the time
    curr_day += td

plt.plot(mean_dcs)
plt.xlabel('Time elapsed')
plt.ylabel('Degree centrality.')
# plt.show()
plt.savefig("_dummyPy090.png", bbox_inches="tight")
plt.clf()


# Import necessary modules
from datetime import timedelta
import networkx as nx
import matplotlib.pyplot as plt

# Instantiate a list to hold the list of most popular forums by day: most_popular_forums
most_popular_forums = []
# Instantiate a list to hold the degree centrality scores of the most popular forums: highest_dcs
highest_dcs = []
curr_day = dayone  
td = timedelta(days=1)  

while curr_day < lastday:  
    if curr_day.day == 1: 
        print(curr_day) 
    # Instantiate new graph: G_sub
    G_sub = nx.Graph()
    
    # Add in nodes from original graph G
    G_sub.add_nodes_from(G.nodes(data=True))
    
    # Add in edges from the original graph G that fulfill the criteria
    G_sub.add_edges_from([(u, v, d) for u, v, d in G.edges(data=True) if d['date'] >= curr_day and d['date'] < curr_day + td])
    
    # CODE CONTINUES ON NEXT EXERCISE
    curr_day += td


# Import necessary modules
from datetime import timedelta
import networkx as nx
import matplotlib.pyplot as plt

most_popular_forums = []
highest_dcs = []
curr_day = dayone 
td = timedelta(days=1)  

while curr_day < lastday:  
    if curr_day.day == 1:  
        print(curr_day)  
    G_sub = nx.Graph()
    G_sub.add_nodes_from(G.nodes(data=True))   
    G_sub.add_edges_from([(u, v, d) for u, v, d in G.edges(data=True) if d['date'] >= curr_day and d['date'] < curr_day + td])
    
    # Get the degree centrality 
    dc = nx.bipartite.degree_centrality(G_sub, forum_nodes)
    # Filter the dictionary such that there's only forum degree centralities
    forum_dcs = {n:dc for n, dc in dc.items() if n in forum_nodes}
    # Identify the most popular forum(s) 
    most_popular_forum = [n for n, dc in forum_dcs.items() if dc == max(forum_dcs.values()) and dc != 0] 
    most_popular_forums.append(most_popular_forum)
    # Store the highest dc values in highest_dcs
    highest_dcs.append(max(forum_dcs.values()))
    
    curr_day += td  

plt.figure(1) 
plt.plot([len(forums) for forums in most_popular_forums], color='blue', label='Forums')
plt.ylabel('Number of Most Popular Forums')
# plt.show()
plt.savefig("_dummyPy091.png", bbox_inches="tight")
plt.clf()


plt.figure(2)
plt.plot(highest_dcs, color='orange', label='DC Score')
plt.ylabel('Top Degree Centrality Score')
# plt.show()
plt.savefig("_dummyPy092.png", bbox_inches="tight")
plt.clf()
## -c:12: FutureWarning: The pandas.tslib module is deprecated and will be removed in a future version.
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 7089 entries, 0 to 7088
## Data columns (total 3 columns):
## date       7089 non-null datetime64[ns]
## forum      7089 non-null object
## student    7089 non-null object
## dtypes: datetime64[ns](1), object(2)
## memory usage: 110.8+ KB
## 2004-06-01 00:00:00
## 2004-07-01 00:00:00
## 2004-09-01 00:00:00
## 2004-10-01 00:00:00
## 2004-06-01 00:00:00
## 2004-07-01 00:00:00
## 2004-09-01 00:00:00
## 2004-10-01 00:00:00
## 2004-06-01 00:00:00
## 2004-07-01 00:00:00
## 2004-08-01 00:00:00
## 2004-09-01 00:00:00
## 2004-10-01 00:00:00
## 2004-06-01 00:00:00
## 2004-07-01 00:00:00
## 2004-08-01 00:00:00
## 2004-09-01 00:00:00
## 2004-10-01 00:00:00

Degree Centrality (Student):

Degree Centrality (Forum):

Circos Plot of Student-Forum Database:

Evolution of Posting:

Evolution of Degree Centrality:

Most Popular Forum Concentrations:

Most Connected User Concentrations:

Python Import and Clean Data

Importing Data in Python (Part I)

Chapter 1 - Introduction and flat files

Welcome to the course - importing from 1) flat files, 2) other native data, and 3) relational databases:

  • Begin by looking at text files - plain text and table data (each row is an observation)
  • The python “open()” function is the easiest way to look at a file
    • filename = “myFile” ; fPointer = open(filename, mode=“r”), fText = fPointer.read(); file.close()
    • print(fText) # All the text will be printed to the console
  • Alternately, can use “with open(”myFile“, mode=”r“) as fPointer: # the file will close when the with ends
    • The “with” statement is known as a context manager
    • The use of a context manager is a best practice, since you never have to worry about closing a file

The importance of flat files in data science:

  • Flat files are text files containing records (which is to say “table data” with each row being an observation and each column being an attribute)
  • Flat files may also have a header describing the columns of the data (important to know for the data import process)
  • Flat files are especially relevant for data science since they are a nice way to store tidy data
  • Flat files may be separated by delimitors (comma, tab, etc.)
  • Imports may be done through numpy or pandas

Importing flat files using numpy (only for data that is purely numerical):

  • numpy arrays are the Python standard for storing numerical data; efficient, fast, and clean, and also often essential for other packages
  • numpy.loadtxt() - import numpy as np; myData=np.loadtxt(“myFile”, delimiter=, skiprows=0, usecols=myList, dtype=) # default delimiter is any whitespace, default skip-rows is 0, default usecols is ALL, dtype=str will load as strings
    • Tends to break down when loading mixed data types; these are typically better for pandas
  • numpy.genfromtxt() is another option, though only briefly mentioned in this course

Importing flat files using pandas - create 2-D data structures with columns of different data types:

  • The pandas package is designed to help elevate Python from data munging (where it has always been excellent) to the full data analysis workflow (which might otherwise require R)
  • The pandas DataFrame is modeled off the data frame in R; same idea of observations (rows) and variables (columns)
  • The pandas package is the current best practice for loading data from flat files in to Python
  • In the most basic usage, myData = pd.read_csv(“myFile”) # assumes import pandas as pd called previously
    • myData.head() # shows the first 5 rows of the data
    • myData.values # This will be the associated numpy array

Example code includes:


# put in directory ./PythonInputFiles/
# moby_dick.txt (converted to romeo-full.txt)
# digits.csv (using mnist_test.csv)
# digits_header.txt (skipped)
# seaslug.txt (downloaded)
# titanic.csv (converted from R)
# titanic_corrupt.txt (skipped)

myPath = "./PythonInputFiles/"


# NEED FILE "moby_dick.txt" (used "romeo-full.txt" instead)
# Open a file: file
file = open(myPath + "romeo-full.txt", mode="r")

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)


# Read & print the first 3 lines
with open(myPath + "romeo-full.txt") as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())


# NEED DIGIT RECOGNITION SITE - see http://yann.lecun.com/exdb/mnist/
# Import package
import numpy as np

# Assign filename to variable: file
file = myPath + 'mnist_test.csv'

# Load file as array: digits
digits = np.loadtxt(file, delimiter=",")

# Print datatype of digits
print(type(digits))

# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))


import matplotlib.pyplot as plt  # so the plotting below can be done

# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
# plt.show()
plt.savefig("_dummyPy042.png", bbox_inches="tight")
plt.clf()

# File should be tab-delimited and with a header row (for the skiprows=1)
# Assign the filename: file
# file = 'digits_header.txt'

# Load the data: data
# data = np.loadtxt(file, delimiter="\t", skiprows=1, usecols=[0, 2])

# Print data
# print(data)


# NEED FILE FROM http://www.stat.ucla.edu/projects/datasets/seaslug-explanation.html
# Should be floats with a single text header row, and tab-delimited

# Assign filename: file
file = myPath + 'seaslug.txt'

# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)

# Print the first element of data
print(data[0])

# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter="\t", dtype=float, skiprows=1)

# Print the 10th element of data_float
print(data_float[9])

# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
# plt.show()
plt.savefig("_dummyPy043.png", bbox_inches="tight")
plt.clf()

# NEED FILE "titanic.csv"
# Idea is that np.genfromtxt() and np.recfromcsv() can accept mixed data types through making each row its own array; dtype=None lets Python pick the data type by column

# Assign the filename: file
# file = myPath + 'titanic.csv'

# Import file using np.recfromcsv: d
# d=np.recfromcsv(file)   # This is like np.genfromtxt() with defaults set to dtype=None, delimiter=",", names=True

# Print out first three entries of d
# print(d[:3])


# PassengerId-Survived-Pclass-Sex-Age-SibSp-Parch-Ticket-Fare-Cabin-Embarked
# Import pandas as pd
import pandas as pd

# Assign the filename: file
file = myPath + 'titanic.csv'

# Read the file into a DataFrame: df
df = pd.read_csv(file)

# View the head of the DataFrame
print(df.head())



# Assign the filename: file
file = myPath + 'mnist_test.csv'

# Read the first 5 rows of the file into a DataFrame: data
data=pd.read_csv(file, nrows=5, header=None)

# Build a numpy array from the DataFrame: data_array
data_array = data.values

# Print the datatype of data_array to the shell
print(type(data_array))


# Assign filename: file
# file = 'titanic_corrupt.txt'

# Import file: data
# data = pd.read_csv(file, sep="\t", comment="#", na_values=["Nothing"])

# Print the head of the DataFrame
# print(data.head())

# Plot 'Age' variable in a histogram
# pd.DataFrame.hist(data[['Age']])
# plt.xlabel('Age (years)')
# plt.ylabel('count')
# plt.show()
## Romeo and Juliet
## Act 2, Scene 2 
## 
## SCENE II. Capulet's orchard.
## 
## Enter ROMEO
## 
## ROMEO
## 
## He jests at scars that never felt a wound.
## JULIET appears above at a window
## 
## But, soft! what light through yonder window breaks?
## It is the east, and Juliet is the sun.
## Arise, fair sun, and kill the envious moon,
## Who is already sick and pale with grief,
## That thou her maid art far more fair than she:
## Be not her maid, since she is envious;
## Her vestal livery is but sick and green
## And none but fools do wear it; cast it off.
## It is my lady, O, it is my love!
## O, that she knew she were!
## She speaks yet she says nothing: what of that?
## Her eye discourses; I will answer it.
## I am too bold, 'tis not to me she speaks:
## Two of the fairest stars in all the heaven,
## Having some business, do entreat her eyes
## To twinkle in their spheres till they return.
## What if her eyes were there, they in her head?
## The brightness of her cheek would shame those stars,
## As daylight doth a lamp; her eyes in heaven
## Would through the airy region stream so bright
## That birds would sing and think it were not night.
## See, how she leans her cheek upon her hand!
## O, that I were a glove upon that hand,
## That I might touch that cheek!
## 
## JULIET
## 
## Ay me!
## 
## ROMEO
## 
## She speaks:
## O, speak again, bright angel! for thou art
## As glorious to this night, being o'er my head
## As is a winged messenger of heaven
## Unto the white-upturned wondering eyes
## Of mortals that fall back to gaze on him
## When he bestrides the lazy-pacing clouds
## And sails upon the bosom of the air.
## 
## JULIET
## 
## O Romeo, Romeo! wherefore art thou Romeo?
## Deny thy father and refuse thy name;
## Or, if thou wilt not, be but sworn my love,
## And I'll no longer be a Capulet.
## 
## ROMEO
## 
## [Aside] Shall I hear more, or shall I speak at this?
## 
## JULIET
## 
## 'Tis but thy name that is my enemy;
## Thou art thyself, though not a Montague.
## What's Montague? it is nor hand, nor foot,
## Nor arm, nor face, nor any other part
## Belonging to a man. O, be some other name!
## What's in a name? that which we call a rose
## By any other name would smell as sweet;
## So Romeo would, were he not Romeo call'd,
## Retain that dear perfection which he owes
## Without that title. Romeo, doff thy name,
## And for that name which is no part of thee
## Take all myself.
## 
## ROMEO
## 
## I take thee at thy word:
## Call me but love, and I'll be new baptized;
## Henceforth I never will be Romeo.
## 
## JULIET
## 
## What man art thou that thus bescreen'd in night
## So stumblest on my counsel?
## 
## ROMEO
## 
## By a name
## I know not how to tell thee who I am:
## My name, dear saint, is hateful to myself,
## Because it is an enemy to thee;
## Had I it written, I would tear the word.
## 
## JULIET
## 
## My ears have not yet drunk a hundred words
## Of that tongue's utterance, yet I know the sound:
## Art thou not Romeo and a Montague?
## 
## ROMEO
## 
## Neither, fair saint, if either thee dislike.
## 
## JULIET
## 
## How camest thou hither, tell me, and wherefore?
## The orchard walls are high and hard to climb,
## And the place death, considering who thou art,
## If any of my kinsmen find thee here.
## 
## ROMEO
## 
## With love's light wings did I o'er-perch these walls;
## For stony limits cannot hold love out,
## And what love can do that dares love attempt;
## Therefore thy kinsmen are no let to me.
## 
## JULIET
## 
## If they do see thee, they will murder thee.
## 
## ROMEO
## 
## Alack, there lies more peril in thine eye
## Than twenty of their swords: look thou but sweet,
## And I am proof against their enmity.
## 
## JULIET
## 
## I would not for the world they saw thee here.
## 
## ROMEO
## 
## I have night's cloak to hide me from their sight;
## And but thou love me, let them find me here:
## My life were better ended by their hate,
## Than death prorogued, wanting of thy love.
## 
## JULIET
## 
## By whose direction found'st thou out this place?
## 
## ROMEO
## 
## By love, who first did prompt me to inquire;
## He lent me counsel and I lent him eyes.
## I am no pilot; yet, wert thou as far
## As that vast shore wash'd with the farthest sea,
## I would adventure for such merchandise.
## 
## JULIET
## 
## Thou know'st the mask of night is on my face,
## Else would a maiden blush bepaint my cheek
## For that which thou hast heard me speak to-night
## Fain would I dwell on form, fain, fain deny
## What I have spoke: but farewell compliment!
## Dost thou love me? I know thou wilt say 'Ay,'
## And I will take thy word: yet if thou swear'st,
## Thou mayst prove false; at lovers' perjuries
## Then say, Jove laughs. O gentle Romeo,
## If thou dost love, pronounce it faithfully:
## Or if thou think'st I am too quickly won,
## I'll frown and be perverse an say thee nay,
## So thou wilt woo; but else, not for the world.
## In truth, fair Montague, I am too fond,
## And therefore thou mayst think my 'havior light:
## But trust me, gentleman, I'll prove more true
## Than those that have more cunning to be strange.
## I should have been more strange, I must confess,
## But that thou overheard'st, ere I was ware,
## My true love's passion: therefore pardon me,
## And not impute this yielding to light love,
## Which the dark night hath so discovered.
## 
## ROMEO
## 
## Lady, by yonder blessed moon I swear
## That tips with silver all these fruit-tree tops--
## 
## JULIET
## 
## O, swear not by the moon, the inconstant moon,
## That monthly changes in her circled orb,
## Lest that thy love prove likewise variable.
## 
## ROMEO
## 
## What shall I swear by?
## 
## JULIET
## 
## Do not swear at all;
## Or, if thou wilt, swear by thy gracious self,
## Which is the god of my idolatry,
## And I'll believe thee.
## 
## ROMEO
## 
## If my heart's dear love--
## 
## JULIET
## 
## Well, do not swear: although I joy in thee,
## I have no joy of this contract to-night:
## It is too rash, too unadvised, too sudden;
## Too like the lightning, which doth cease to be
## Ere one can say 'It lightens.' Sweet, good night!
## This bud of love, by summer's ripening breath,
## May prove a beauteous flower when next we meet.
## Good night, good night! as sweet repose and rest
## Come to thy heart as that within my breast!
## 
## ROMEO
## 
## O, wilt thou leave me so unsatisfied?
## 
## JULIET
## 
## What satisfaction canst thou have to-night?
## 
## ROMEO
## 
## The exchange of thy love's faithful vow for mine.
## 
## JULIET
## 
## I gave thee mine before thou didst request it:
## And yet I would it were to give again.
## 
## ROMEO
## 
## Wouldst thou withdraw it? for what purpose, love?
## 
## JULIET
## 
## But to be frank, and give it thee again.
## And yet I wish but for the thing I have:
## My bounty is as boundless as the sea,
## My love as deep; the more I give to thee,
## The more I have, for both are infinite.
## 
## Nurse calls within
## 
## I hear some noise within; dear love, adieu!
## Anon, good nurse! Sweet Montague, be true.
## Stay but a little, I will come again.
## Exit, above
## 
## ROMEO
## 
## O blessed, blessed night! I am afeard.
## Being in night, all this is but a dream,
## Too flattering-sweet to be substantial.
## 
## Re-enter JULIET, above
## 
## JULIET
## 
## Three words, dear Romeo, and good night indeed.
## If that thy bent of love be honourable,
## Thy purpose marriage, send me word to-morrow,
## By one that I'll procure to come to thee,
## Where and what time thou wilt perform the rite;
## And all my fortunes at thy foot I'll lay
## And follow thee my lord throughout the world.
## 
## Nurse
## 
## [Within] Madam!
## 
## JULIET
## 
## I come, anon.--But if thou mean'st not well,
## I do beseech thee--
## 
## Nurse
## [Within] Madam!
## 
## JULIET
## 
## By and by, I come:--
## To cease thy suit, and leave me to my grief:
## To-morrow will I send.
## 
## ROMEO
## 
## So thrive my soul--
## 
## JULIET
## 
## A thousand times good night!
## Exit, above
## 
## ROMEO
## 
## A thousand times the worse, to want thy light.
## Love goes toward love, as schoolboys from
## their books,
## But love from love, toward school with heavy looks.
## Retiring
## 
## Re-enter JULIET, above
## 
## JULIET
## 
## Hist! Romeo, hist! O, for a falconer's voice,
## To lure this tassel-gentle back again!
## Bondage is hoarse, and may not speak aloud;
## Else would I tear the cave where Echo lies,
## And make her airy tongue more hoarse than mine,
## With repetition of my Romeo's name.
## 
## ROMEO
## 
## It is my soul that calls upon my name:
## How silver-sweet sound lovers' tongues by night,
## Like softest music to attending ears!
## 
## JULIET
## 
## Romeo!
## 
## ROMEO
## 
## My dear?
## 
## JULIET
## 
## At what o'clock to-morrow
## Shall I send to thee?
## 
## ROMEO
## 
## At the hour of nine.
## 
## JULIET
## 
## I will not fail: 'tis twenty years till then.
## I have forgot why I did call thee back.
## 
## ROMEO
## 
## Let me stand here till thou remember it.
## 
## JULIET
## 
## I shall forget, to have thee still stand there,
## Remembering how I love thy company.
## 
## ROMEO
## 
## And I'll still stay, to have thee still forget,
## Forgetting any other home but this.
## 
## JULIET
## 
## 'Tis almost morning; I would have thee gone:
## And yet no further than a wanton's bird;
## Who lets it hop a little from her hand,
## Like a poor prisoner in his twisted gyves,
## And with a silk thread plucks it back again,
## So loving-jealous of his liberty.
## 
## ROMEO
## 
## I would I were thy bird.
## 
## JULIET
## 
## Sweet, so would I:
## Yet I should kill thee with much cherishing.
## Good night, good night! parting is such
## sweet sorrow,
## That I shall say good night till it be morrow.
## 
## Exit above
## 
## ROMEO
## 
## Sleep dwell upon thine eyes, peace in thy breast!
## Would I were sleep and peace, so sweet to rest!
## Hence will I to my ghostly father's cell,
## His help to crave, and my dear hap to tell.
## 
## Exit
## 
## False
## True
## Romeo and Juliet
## 
## Act 2, Scene 2 
## 
## 
## 
## <class 'numpy.ndarray'>
## ["b'Time'" "b'Percent'"]
## [ 0.     0.357]
##    Unnamed: 0  PassengerId  Survived  Pclass  \
## 0           1            1         0       3   
## 1           2            2         1       1   
## 2           3            3         1       3   
## 3           4            4         1       1   
## 4           5            5         0       3   
## 
##                                                 Name     Sex   Age  SibSp  \
## 0                            Braund, Mr. Owen Harris    male  22.0      1   
## 1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
## 2                             Heikkinen, Miss. Laina  female  26.0      0   
## 3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
## 4                           Allen, Mr. William Henry    male  35.0      0   
## 
##    Parch            Ticket     Fare Cabin Embarked  
## 0      0         A/5 21171   7.2500   NaN        S  
## 1      0          PC 17599  71.2833   C85        C  
## 2      0  STON/O2. 3101282   7.9250   NaN        S  
## 3      0            113803  53.1000  C123        S  
## 4      0            373450   8.0500   NaN        S  
## <class 'numpy.ndarray'>

Example Image Recognition Digit:

Sea Slug Data:


Chapter 2 - Importing data from other file types

Introduction to other files types - Excel spreadsheets, MATLAB, SAS, Stata, HDF5 (becoming a more relevant format for saving data):

  • There are also “pickled” files which are native to Python; idea is that you can serialize files like dictionaries or lists for later use in Python (rather than using json which is more human-readable)
  • Opening a pickled file: import pickle; with open(“myFile,pkl”, mode=“rb”) as file: data=pickle.load(file)
  • Excel files can generally be opened using data=pd.ExcelFile(“myExcel.xlsx”) # assumes previous import pandas as pd; automatically loads the Excel sheet as a data frame
    • data.sheet_names # provides a list of the sheet names
    • df1 = data.parse(“sheetName”) # can pass either the index as a float or the sheet name as a string
    • Can also skip rows and import only certain columns

Importing SAS/Stata files using pandas:

  • SAS: Statistical Analysis System is common for business analytics and biostatistics
  • Stata: Statistics + Data is common for academic social sciences research
  • The most common SAS files have the extensions .sas7bdat and .sas7cdat
    • from sas7bdat import SAS7BDAT
    • with SAS7BDAT(“mySASfile.sas7bdat”) as file: df_sas=file.to_data_frame() # as per previous examples
  • The Stata files can be imported directly using pd
    • pd.read_stata(“myStataFile.dta”)

Importing HDF5 (Hierarchical Data Format 5) files, quickly becoming the Python standard for storing large quantities of numerical data:

  • HDF5 can scale up to exabytes of data, and is commonly used for files of hundereds of gigabytes or even terabytes
  • import h5py; data=h5py.File(“myHD5.hd5”, “r”); for key in data.keys(): print(key)
    • might have “meta”, “quality”, and “strain” for a specific LIGO data file
    • could dive further in to any of the keys, for example for key in data[“meta”].keys(): print(key)
  • The HDF project is formally managed by the HDF group, a Champaign-based spinoff of the University of Illinois

Importing MATLAB (MATrix LABoratory) files - industry standard in engineering and science:

  • The library scipy has functions scipy.io.loadmat() and scipy.io.savemat()
    • The loaded file will be a dictionary (keys are the variable names and values are the objects assigned to the variables)
  • COULD NOT GET scipy to import (lack of blas???)

Example code includes:


myPath = "./PythonInputFiles/"

# Import pickle package
import pickle

# NEED PICKLE DATA - {'Mar': '84.4', 'June': '69.4', 'Airline': '8', 'Aug': '85'}
# Created using with open(myPath + "data.pkl", "wb") as file: pickle.dump(myDict, file)
# Open pickle file and load data: d
with open(myPath + 'data.pkl', mode="rb") as file:
    d = pickle.load(file)

# Print d
print(d)

# Print datatype of d
print(type(d))


# NEED BATTLE DEATHS DATA - https://www.prio.org/Data/Armed-Conflict/Battle-Deaths/The-Battle-Deaths-Dataset-version-30/ (downloaded and converted name to "battledeath.xlsx")
# Import pandas
import pandas as pd

# Assign spreadsheet filename: file
file = myPath + "battledeath.xlsx"

# Load spreadsheet: xl
xl = pd.ExcelFile(file)

# Print sheet names
print(xl.sheet_names)


# Load a sheet into a DataFrame by name: df1
# There is only one sheet absent converting "bdonly" to a file by year
df1 = xl.parse("bdonly")

# Print the head of the DataFrame df1
print(df1.head())

# Load a sheet into a DataFrame by index: df2
df2 = xl.parse(0)

# Print the head of the DataFrame df2
print(df2.head())


# Parse the first sheet and rename the columns: df1
df1 = xl.parse(0, skiprows=[0], parse_cols=[2, 9], names=["AAM due to War (2002)", "Country"])

# Print the head of the DataFrame df1
print(df1.head())

# Parse the tenth column of the first sheet and rename the column: df2
df2 = xl.parse(0, parse_cols=[9], skiprows=[0], names=["Country"])

# Print the head of the DataFrame df2
print(df2.head())


# DO NOT HAVE THIS FILE EITHER
# Import sas7bdat package
from sas7bdat import SAS7BDAT

# Save file to a DataFrame: df_sas
# with SAS7BDAT('sales.sas7bdat') as file:
#     df_sas = file.to_data_frame()

# Print head of DataFrame
# print(df_sas.head())

import matplotlib.pyplot as plt

# Plot histogram of DataFrame features (pandas and pyplot already imported)
# pd.DataFrame.hist(df_sas[['P']])
# plt.ylabel('count')
# plt.show()


# DO NOT HAVE THIS FILE EITHER
# Import pandas

# Load Stata file into a pandas DataFrame: df
# df = pd.read_stata("disarea.dta")

# Print the head of the DataFrame df
# print(df.head())

# Plot histogram of one column of the DataFrame
# pd.DataFrame.hist(df[['disa10']])
# plt.xlabel('Extent of disease')
# plt.ylabel('Number of coutries')
# plt.show()


# DO NOT HAVE THIS FILE EITHER
# Import packages
import numpy as np
import h5py

# Assign filename: file
# file = 'LIGO_data.hdf5'

# Load file: data
# data = h5py.File(file, "r")

# Print the datatype of the loaded file
# print(type(data))

# Print the keys of the file
# for key in data.keys():
#     print(key)


# Get the HDF5 group: group
# group = data["strain"]

# Check out keys of group
# for key in group.keys():
#     print(key)

# Set variable equal to time series data: strain
# strain = data['strain']['Strain'].value

# Set number of time points to sample: num_samples
# num_samples = 10000

# Set time vector
# time = np.arange(0, 1, 1/num_samples)

# Plot data
# plt.plot(time, strain[:num_samples])
# plt.xlabel('GPS Time (s)')
# plt.ylabel('strain')
# plt.show()


# DO NOT HAVE THIS FILE EITHER - see https://www.mcb.ucdavis.edu/faculty-labs/albeck/workshop.htm
# Import package (cannot get to download)
# import scipy.io

# Load MATLAB file: mat
# mat = scipy.io.loadmat('albeck_gene_expression.mat')

# Print the datatype type of mat
# print(type(mat))


# Print the keys of the MATLAB dictionary
# print(mat.keys())

# Print the type of the value corresponding to the key 'CYratioCyt'
# print(type(mat["CYratioCyt"]))

# Print the shape of the value corresponding to the key 'CYratioCyt'
# print(np.shape(mat["CYratioCyt"]))

# Subset the array and plot it
# data = mat['CYratioCyt'][25, 5:]
# fig = plt.figure()
# plt.plot(data)
# plt.xlabel('time (min.)')
# plt.ylabel('normalized fluorescence (measure of expression)')
# plt.show()
## {'Mar': '84.4', 'June': '69.4', 'Airline': '8', 'Aug': '85'}
## <class 'dict'>
## ['bdonly']
##    id  year  bdeadlow  bdeadhig  bdeadbes  annualdata  source  bdversion  \
## 0   1  1946      1000      9999      1000           2       1          3   
## 1   1  1952       450      3000      -999           2       1          3   
## 2   1  1967        25       999        82           2       1          3   
## 3   2  1946        25       999      -999           0       0          3   
## 4   2  1947        25       999      -999           0       0          3   
## 
##    location    sidea   ...    epend  ependdate ependprec  gwnoa gwnoa2nd  \
## 0   Bolivia  Bolivia   ...        1 1946-07-21     -99.0    145      NaN   
## 1   Bolivia  Bolivia   ...        1 1952-04-12     -99.0    145      NaN   
## 2   Bolivia  Bolivia   ...        1 1967-10-16     -99.0    145      NaN   
## 3  Cambodia   France   ...        0        NaT       NaN    220      NaN   
## 4  Cambodia   France   ...        0        NaT       NaN    220      NaN   
## 
##    gwnob  gwnob2nd  gwnoloc region  version  
## 0    NaN       NaN      145      5   2009-4  
## 1    NaN       NaN      145      5   2009-4  
## 2    NaN       NaN      145      5   2009-4  
## 3    NaN       NaN      811      3   2009-4  
## 4    NaN       NaN      811      3   2009-4  
## 
## [5 rows x 32 columns]
##    id  year  bdeadlow  bdeadhig  bdeadbes  annualdata  source  bdversion  \
## 0   1  1946      1000      9999      1000           2       1          3   
## 1   1  1952       450      3000      -999           2       1          3   
## 2   1  1967        25       999        82           2       1          3   
## 3   2  1946        25       999      -999           0       0          3   
## 4   2  1947        25       999      -999           0       0          3   
## 
##    location    sidea   ...    epend  ependdate ependprec  gwnoa gwnoa2nd  \
## 0   Bolivia  Bolivia   ...        1 1946-07-21     -99.0    145      NaN   
## 1   Bolivia  Bolivia   ...        1 1952-04-12     -99.0    145      NaN   
## 2   Bolivia  Bolivia   ...        1 1967-10-16     -99.0    145      NaN   
## 3  Cambodia   France   ...        0        NaT       NaN    220      NaN   
## 4  Cambodia   France   ...        0        NaT       NaN    220      NaN   
## 
##    gwnob  gwnob2nd  gwnoloc region  version  
## 0    NaN       NaN      145      5   2009-4  
## 1    NaN       NaN      145      5   2009-4  
## 2    NaN       NaN      145      5   2009-4  
## 3    NaN       NaN      811      3   2009-4  
## 4    NaN       NaN      811      3   2009-4  
## 
## [5 rows x 32 columns]
##    AAM due to War (2002)  Country
## 0                    450  Bolivia
## 1                     25  Bolivia
## 2                     25   France
## 3                     25   France
## 4                     25   France
##    Country
## 0  Bolivia
## 1  Bolivia
## 2   France
## 3   France
## 4   France

Chapter 3 - Relational databases

Introduction to relational databases - standard discussion of how a relational database (system of tables) works:

  • Each of the tables is a data frame, keyed by a primary key (unique identifier for the row in question)
  • The tables are all linked by way of the primary keys, and the existence of these keys as columns in some of the other tables
  • The relational linking process saves a great deal of space
  • Many systems exist, such as PostgreSQL, MySQL, SQLite, and the like
  • SQL is an acronym for “Structured Query Language” which is a standard way for interacting with the relational databases

Creating a database engine in Python - goal is to get data out of the relational database using SQL:

  • SQLite is nice since it is fast and simple, though other databases may have additional valuable features
  • The package “SQLAlchemy” works with many other RDBMS (relational database management systems)
    • from sqlalchemy import create_engine
    • engine = create_engine(“mySQLDatabase.sqlite”) # may have different extensions if a different type of database
    • engine.table_names() # provides the names of all the tables in engine

Querying relational databases in Python - connecting to the engine and then querying (getting data out from) the database:

  • SELECT * FROM myTable will bring over all columns of all rows
  • General workflow for SQL in Python include: 1) import packages, 2) create the DB engine, 3) connect to the engine, 4) query the database, 5) save query results to a DataFrame, and 6) close the connection
    • Step 3: con = engine.connect()
    • Step 4: rs = con.execute(“valid SQL queries”)
    • Step 5: df = pd.DataFrame(rs.fetchall()) ; df.columns = rs.keys() # if wanting to bring over meaningful column names
    • Step 6: con.close()
  • A context manager (with engine.connect() as con) can save the hassle of con.close(), or worse forgetting to close the connection
  • Note that rs.fetchmany(size=5) is an option for bringing over just 5 lines from the query (can use numbers other than 5 also)

Querying relational databases directly with pandas - shortcut to the above process:

  • df = pd.read_sql_query(“valid SQL code”, engine) # where import pandas as pd and engine = create_engine(“mySQLConnection”) have previously been run

Advanced querying - exploiting table relationships (combining mutliple tables):

  • The SQL join to bring 2+ tables together
  • SELECT myVars FROM Table1 INNER JOIN Table2 ON joinCriteria
    • Note that the format for variables is Table.Variable, so Orders.CustomerID = Customers.CustomerID

Example code includes:


myPath = "./PythonInputFiles/"

# NEED FILE - may be able to get at http://chinookdatabase.codeplex.com/
# Downloaded the ZIP, extracted the SQLite, and renamed to Chinook.sqlite
# Import necessary module
from sqlalchemy import create_engine

# Create engine: engine
engine = create_engine('sqlite:///' + myPath + 'Chinook.sqlite')  # The sqlite:/// is called the 'connection string'


# Save the table names to a list: table_names
table_names = engine.table_names()

# Print the table names to the shell
print(table_names)


# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///' + myPath + 'Chinook.sqlite')

# Open engine connection: con
con = engine.connect()

# Perform query: rs
rs = con.execute("SELECT * FROM Album")

# Save results of the query to DataFrame: df
df = pd.DataFrame(rs.fetchall())

# Close connection
con.close()

# Print head of DataFrame df
print(df.head())


# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT LastName, Title FROM Employee")
    df = pd.DataFrame(rs.fetchmany(size=3))
    df.columns = rs.keys()

# Print the length of the DataFrame df
print(len(df))

# Print the head of the DataFrame df
print(df.head())


# Create engine: engine
engine = create_engine('sqlite:///' + myPath + 'Chinook.sqlite')

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee WHERE EmployeeID >= 6")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print the head of the DataFrame df
print(df.head())


# Create engine: engine
engine = create_engine('sqlite:///' + myPath + 'Chinook.sqlite')

# Open engine in context manager
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee ORDER BY BirthDate")
    df = pd.DataFrame(rs.fetchall())
    
    # Set the DataFrame's column names
    df.columns = rs.keys()

# Print head of DataFrame
print(df.head())


# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///' + myPath + 'Chinook.sqlite')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Album", engine)

# Print head of DataFrame
print(df.head())

# Open engine in context manager
# Perform query and save results to DataFrame: df1
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Album")
    df1 = pd.DataFrame(rs.fetchall())
    df1.columns = rs.keys()

# Confirm that both methods yield the same result: does df = df1 ?
print(df.equals(df1))


# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///' + myPath + 'Chinook.sqlite')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Employee WHERE EmployeeId >= 6 ORDER BY BirthDate", engine)

# Print head of DataFrame
print(df.head())


# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT Title, Name FROM Album INNER JOIN Artist ON Album.ArtistID = Artist.ArtistID")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print head of DataFrame df
print(df.head())


# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN Track ON PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000", engine)

# Print head of DataFrame
print(df.head())
## ['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']
##    0                                      1  2
## 0  1  For Those About To Rock We Salute You  1
## 1  2                      Balls to the Wall  2
## 2  3                      Restless and Wild  2
## 3  4                      Let There Be Rock  1
## 4  5                               Big Ones  3
## 3
##   LastName                Title
## 0    Adams      General Manager
## 1  Edwards        Sales Manager
## 2  Peacock  Sales Support Agent
##    EmployeeId  LastName FirstName       Title  ReportsTo            BirthDate  \
## 0           6  Mitchell   Michael  IT Manager          1  1973-07-01 00:00:00   
## 1           7      King    Robert    IT Staff          6  1970-05-29 00:00:00   
## 2           8  Callahan     Laura    IT Staff          6  1968-01-09 00:00:00   
## 
##               HireDate                      Address        City State Country  \
## 0  2003-10-17 00:00:00         5827 Bowness Road NW     Calgary    AB  Canada   
## 1  2004-01-02 00:00:00  590 Columbia Boulevard West  Lethbridge    AB  Canada   
## 2  2004-03-04 00:00:00                  923 7 ST NW  Lethbridge    AB  Canada   
## 
##   PostalCode              Phone                Fax                    Email  
## 0    T3B 0C5  +1 (403) 246-9887  +1 (403) 246-9899  michael@chinookcorp.com  
## 1    T1K 5N8  +1 (403) 456-9986  +1 (403) 456-8485   robert@chinookcorp.com  
## 2    T1H 1Y8  +1 (403) 467-3351  +1 (403) 467-8772    laura@chinookcorp.com  
##    EmployeeId  LastName FirstName                Title  ReportsTo  \
## 0           4      Park  Margaret  Sales Support Agent        2.0   
## 1           2   Edwards     Nancy        Sales Manager        1.0   
## 2           1     Adams    Andrew      General Manager        NaN   
## 3           5   Johnson     Steve  Sales Support Agent        2.0   
## 4           8  Callahan     Laura             IT Staff        6.0   
## 
##              BirthDate             HireDate              Address        City  \
## 0  1947-09-19 00:00:00  2003-05-03 00:00:00     683 10 Street SW     Calgary   
## 1  1958-12-08 00:00:00  2002-05-01 00:00:00         825 8 Ave SW     Calgary   
## 2  1962-02-18 00:00:00  2002-08-14 00:00:00  11120 Jasper Ave NW    Edmonton   
## 3  1965-03-03 00:00:00  2003-10-17 00:00:00         7727B 41 Ave     Calgary   
## 4  1968-01-09 00:00:00  2004-03-04 00:00:00          923 7 ST NW  Lethbridge   
## 
##   State Country PostalCode              Phone                Fax  \
## 0    AB  Canada    T2P 5G3  +1 (403) 263-4423  +1 (403) 263-4289   
## 1    AB  Canada    T2P 2T3  +1 (403) 262-3443  +1 (403) 262-3322   
## 2    AB  Canada    T5K 2N1  +1 (780) 428-9482  +1 (780) 428-3457   
## 3    AB  Canada    T3B 1Y7   1 (780) 836-9987   1 (780) 836-9543   
## 4    AB  Canada    T1H 1Y8  +1 (403) 467-3351  +1 (403) 467-8772   
## 
##                       Email  
## 0  margaret@chinookcorp.com  
## 1     nancy@chinookcorp.com  
## 2    andrew@chinookcorp.com  
## 3     steve@chinookcorp.com  
## 4     laura@chinookcorp.com  
##    AlbumId                                  Title  ArtistId
## 0        1  For Those About To Rock We Salute You         1
## 1        2                      Balls to the Wall         2
## 2        3                      Restless and Wild         2
## 3        4                      Let There Be Rock         1
## 4        5                               Big Ones         3
## True
##    EmployeeId  LastName FirstName       Title  ReportsTo            BirthDate  \
## 0           8  Callahan     Laura    IT Staff          6  1968-01-09 00:00:00   
## 1           7      King    Robert    IT Staff          6  1970-05-29 00:00:00   
## 2           6  Mitchell   Michael  IT Manager          1  1973-07-01 00:00:00   
## 
##               HireDate                      Address        City State Country  \
## 0  2004-03-04 00:00:00                  923 7 ST NW  Lethbridge    AB  Canada   
## 1  2004-01-02 00:00:00  590 Columbia Boulevard West  Lethbridge    AB  Canada   
## 2  2003-10-17 00:00:00         5827 Bowness Road NW     Calgary    AB  Canada   
## 
##   PostalCode              Phone                Fax                    Email  
## 0    T1H 1Y8  +1 (403) 467-3351  +1 (403) 467-8772    laura@chinookcorp.com  
## 1    T1K 5N8  +1 (403) 456-9986  +1 (403) 456-8485   robert@chinookcorp.com  
## 2    T3B 0C5  +1 (403) 246-9887  +1 (403) 246-9899  michael@chinookcorp.com  
##                                    Title       Name
## 0  For Those About To Rock We Salute You      AC/DC
## 1                      Balls to the Wall     Accept
## 2                      Restless and Wild     Accept
## 3                      Let There Be Rock      AC/DC
## 4                               Big Ones  Aerosmith
##    PlaylistId  TrackId  TrackId              Name  AlbumId  MediaTypeId  \
## 0           1     3390     3390  One and the Same      271            2   
## 1           1     3392     3392     Until We Fall      271            2   
## 2           1     3393     3393     Original Fire      271            2   
## 3           1     3394     3394       Broken City      271            2   
## 4           1     3395     3395          Somedays      271            2   
## 
##    GenreId Composer  Milliseconds    Bytes  UnitPrice  
## 0       23     None        217732  3559040       0.99  
## 1       23     None        230758  3766605       0.99  
## 2       23     None        218916  3577821       0.99  
## 3       23     None        228366  3728955       0.99  
## 4       23     None        213831  3497176       0.99

Importing Data in Python (Part II)

Chapter 1 - Importing Data from the Internet

Importing flat files from the web - non-local files:

  • Clicking on URL and downloading files creates reproducibility problems and is non-scalable
  • Course covers 1) import and locally save from the web, 2) load datasets in to pandas DataFrames, 3) make HTTP requests, 4) scrape HTML (BeustifulSoup)
  • This course will particularly focus on “urllib” and “requests” packages
  • The “urllib” package has an interface for fetching data from across the web
    • urllib.urlopen(“myURL”) # Very similar to open() but takes an URL rather than a local file name
    • from urllib.request import urlretrieve ; url = “myQuotedURL” ; urlretrieve(url, “myLocalFileName”)

HTTP requests to import files from the web - unpacking the urlretrieve from urllib.request:

  • URL is an acronym for Uniform/Universal Resource Locator (reference to web resources such as web addresses, FTP, and the like)
  • Ingredients for an URL include 1) protocol identifier (such as “http:”) and a resource name (such as “datacamp.com”)
  • HTTP is an acronym for Hyper-Text Transfer Protocol which is the foundation for data communication on the web
    • Going to a website is the process of sending a GET request through HTTP ; the urlretrieve does this automatically
  • HTML is an acronym for HyperText Markup Language, which is the standard mark-up language used on the internet
  • Example process for GET requests using urllib
    • from urllib.request import urlopen, Request
    • url = “https://www.wikipedia.org/” ; request = Request(url) ; response = urlopen(request) ; html = response.read() ; response.close()
  • Can also send GET requests using “requests”“, a commonly used package that simplifies the process

Scraping the web in Python using BeautifulSoup - make sense of the jumbled, unstructured HTML data:

  • Structured data has either 1) a pre-defined data model, or 2) organization in a defined manner
  • HTML is unstructured data, possessing neither of these properties
  • BeautifulSoup parses and extracts structured data from HTML
  • General usage would include
    • from bs4 import BeautifulSoup ; import requests
    • url = “https://www.crummy.com/software/BeautifulSoup/
    • r = requests.get(url) ; html_doc = r.text
    • soup = BeautifulSoup(html_doc)
    • print(soup.prettify()) # printes properly indented html code, easier for human parsing

Example code includes:


# Import package
from urllib.request import urlretrieve
import pandas as pd

# Assign url of file: url (ran once - no need to re-run)
# url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
# urlretrieve(url, 'winequality-red.csv')

# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print(df.head())


# Import packages
import matplotlib.pyplot as plt
import pandas as pd

# Assign url of file: url (ran once - no need to re-run)
# url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Read file into a DataFrame: df
# df = pd.read_csv(url, sep=";")

# Print the head of the DataFrame
# print(df.head())

# Plot first column of df
pd.DataFrame.hist(df.ix[:, 0:1])
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
# plt.show()
plt.savefig("_dummyPy044.png", bbox_inches="tight")
plt.clf()


# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheetname=None)

# Print the sheetnames to the shell
print(xl.keys())

# Print the head of the first sheet (using its name, NOT its index)
print(xl["1700"].head())


# Import packages
from urllib.request import urlopen, Request

# Specify the url
url = "http://www.datacamp.com/teach/documentation"

# This packages the request: request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Print the datatype of response
print(type(response))

# Be polite and close the response!
response.close()


# Specify the url
url = "http://docs.datacamp.com/teach/"

# This packages the request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Extract the response: html
html = response.read()

# Print the html
print(html)

# Be polite and close the response!
response.close()


import requests

# Specify the url: url
url = "http://docs.datacamp.com/teach/"

# Packages the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response: text
text = r.text

# Print the html
print(text)


# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print(pretty_soup)


# Get the title of Guido's webpage: guido_title
guido_title = soup.title

# Print the title of Guido's webpage to the shell
print(guido_title)

# Get Guido's text: guido_text
guido_text = soup.get_text()

# Print Guido's text to the shell
print(guido_text)


# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all("a")

# Print the URLs to the shell
for link in a_tags:
    print(link.get("href"))
##    fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
## 0            7.4              0.70         0.00             1.9      0.076   
## 1            7.8              0.88         0.00             2.6      0.098   
## 2            7.8              0.76         0.04             2.3      0.092   
## 3           11.2              0.28         0.56             1.9      0.075   
## 4            7.4              0.70         0.00             1.9      0.076   
## 
##    free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
## 0                 11.0                  34.0   0.9978  3.51       0.56   
## 1                 25.0                  67.0   0.9968  3.20       0.68   
## 2                 15.0                  54.0   0.9970  3.26       0.65   
## 3                 17.0                  60.0   0.9980  3.16       0.58   
## 4                 11.0                  34.0   0.9978  3.51       0.56   
## 
##    alcohol  quality  
## 0      9.4        5  
## 1      9.8        5  
## 2      9.8        5  
## 3      9.8        6  
## 4      9.4        5  
## odict_keys(['1700', '1900'])
##                  country       1700
## 0            Afghanistan  34.565000
## 1  Akrotiri and Dhekelia  34.616667
## 2                Albania  41.312000
## 3                Algeria  36.720000
## 4         American Samoa -14.307000
## <class 'http.client.HTTPResponse'>
## b'<!DOCTYPE html>\n<link rel="shortcut icon" href="images/favicon.ico" />\n<html>\n\n  <head>\n  <meta charset="utf-8">\n  <meta http-equiv="X-UA-Compatible" content="IE=edge">\n  <meta name="viewport" content="width=device-width, initial-scale=1">\n\n  <title>Home</title>\n  <meta name="description" content="All Documentation on Course Creation">\n\n  <link rel="stylesheet" href="/teach/css/main.css">\n  <link rel="canonical" href="/teach/">\n  <link rel="alternate" type="application/rss+xml" title="DataCamp Teach Documentation" href="/teach/feed.xml" />\n</head>\n\n\n  <body>\n\n    <header class="site-header">\n\n  <div class="wrapper">\n\n    <a class="site-title" href="/teach/">DataCamp Teach Documentation</a>\n\n  </div>\n\n</header>\n\n\n    <div class="page-content">\n      <div class="wrapper">\n        <p>The Teach Documentation has been moved to <a href="https://www.datacamp.com/teach/documentation">https://www.datacamp.com/teach/documentation</a>!</p>\n\n<!-- Everybody can teach on DataCamp. The resources on this website explain all the steps to build your own course on DataCamp\'s interactive data science platform.\n\nInterested in partnering with DataCamp? Head over to the [Course Material](/teach/course-material.html) page to get an idea of the requirements to build your own interactive course together with DataCamp!\n\n## Table of Contents\n\n- [Course Material](/teach/course-material.html) - Content required to build a DataCamp course.\n- [Video Lectures](/teach/video-lectures.html) - Details on video recording and editing.\n- [DataCamp Teach](https://www.datacamp.com/teach) - Use the DataCamp Teach website to create DataCamp courses (preferred).\n- [datacamp R Package](https://github.com/datacamp/datacamp/wiki) - Use R Package to create DataCamp courses (legacy).\n- [Code DataCamp Exercises](/teach/code-datacamp-exercises.html)\n- [SCT Design (R)](https://github.com/datacamp/testwhat/wiki)\n- [SCT Design (Python)](https://github.com/datacamp/pythonwhat/wiki)\n- [Style Guide](/teach/style-guide.html) -->\n\n\n      </div>\n    </div>\n\n    \n\n  </body>\n\n</html>\n'
## <!DOCTYPE html>
## <link rel="shortcut icon" href="images/favicon.ico" />
## <html>
## 
##   <head>
##   <meta charset="utf-8">
##   <meta http-equiv="X-UA-Compatible" content="IE=edge">
##   <meta name="viewport" content="width=device-width, initial-scale=1">
## 
##   <title>Home</title>
##   <meta name="description" content="All Documentation on Course Creation">
## 
##   <link rel="stylesheet" href="/teach/css/main.css">
##   <link rel="canonical" href="/teach/">
##   <link rel="alternate" type="application/rss+xml" title="DataCamp Teach Documentation" href="/teach/feed.xml" />
## </head>
## 
## 
##   <body>
## 
##     <header class="site-header">
## 
##   <div class="wrapper">
## 
##     <a class="site-title" href="/teach/">DataCamp Teach Documentation</a>
## 
##   </div>
## 
## </header>
## 
## 
##     <div class="page-content">
##       <div class="wrapper">
##         <p>The Teach Documentation has been moved to <a href="https://www.datacamp.com/teach/documentation">https://www.datacamp.com/teach/documentation</a>!</p>
## 
## <!-- Everybody can teach on DataCamp. The resources on this website explain all the steps to build your own course on DataCamp's interactive data science platform.
## 
## Interested in partnering with DataCamp? Head over to the [Course Material](/teach/course-material.html) page to get an idea of the requirements to build your own interactive course together with DataCamp!
## 
## ## Table of Contents
## 
## - [Course Material](/teach/course-material.html) - Content required to build a DataCamp course.
## - [Video Lectures](/teach/video-lectures.html) - Details on video recording and editing.
## - [DataCamp Teach](https://www.datacamp.com/teach) - Use the DataCamp Teach website to create DataCamp courses (preferred).
## - [datacamp R Package](https://github.com/datacamp/datacamp/wiki) - Use R Package to create DataCamp courses (legacy).
## - [Code DataCamp Exercises](/teach/code-datacamp-exercises.html)
## - [SCT Design (R)](https://github.com/datacamp/testwhat/wiki)
## - [SCT Design (Python)](https://github.com/datacamp/pythonwhat/wiki)
## - [Style Guide](/teach/style-guide.html) -->
## 
## 
##       </div>
##     </div>
## 
##     
## 
##   </body>
## 
## </html>
## 
## <html>
##  <head>
##   <title>
##    Guido's Personal Home Page
##   </title>
##  </head>
##  <body bgcolor="#FFFFFF" text="#000000">
##   <h1>
##    <a href="pics.html">
##     <img border="0" src="images/IMG_2192.jpg"/>
##    </a>
##    Guido van Rossum - Personal Home Page
##   </h1>
##   <p>
##    <a href="http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm">
##     <i>
##      "Gawky and proud of it."
##     </i>
##    </a>
##    <h3>
##     <a href="http://metalab.unc.edu/Dave/Dr-Fun/df200004/df20000406.jpg">
##      Who
## I Am
##     </a>
##    </h3>
##    <p>
##     Read
## my
##     <a href="http://neopythonic.blogspot.com/2016/04/kings-day-speech.html">
##      "King's
## Day Speech"
##     </a>
##     for some inspiration.
##     <p>
##      I am the author of the
##      <a href="http://www.python.org">
##       Python
##      </a>
##      programming language.  See also my
##      <a href="Resume.html">
##       resume
##      </a>
##      and my
##      <a href="Publications.html">
##       publications list
##      </a>
##      , a
##      <a href="bio.html">
##       brief bio
##      </a>
##      , assorted
##      <a href="http://legacy.python.org/doc/essays/">
##       writings
##      </a>
##      ,
##      <a href="http://legacy.python.org/doc/essays/ppt/">
##       presentations
##      </a>
##      and
##      <a href="interviews.html">
##       interviews
##      </a>
##      (all about Python), some
##      <a href="pics.html">
##       pictures of me
##      </a>
##      ,
##      <a href="http://neopythonic.blogspot.com">
##       my new blog
##      </a>
##      , and
## my
##      <a href="http://www.artima.com/weblogs/index.jsp?blogger=12088">
##       old
## blog
##      </a>
##      on Artima.com.  I am
##      <a href="https://twitter.com/gvanrossum">
##       @gvanrossum
##      </a>
##      on Twitter.  I
## also have
## a
##      <a href="https://plus.google.com/u/0/115212051037621986145/posts">
##       G+
## profile
##      </a>
##      .
##      <p>
##       In January 2013 I joined
##       <a href="http://www.dropbox.com">
##        Dropbox
##       </a>
##       .  I work on various Dropbox
## products and have 50% for my Python work, no strings attached.
## Previously, I have worked for Google, Elemental Security, Zope
## Corporation, BeOpen.com, CNRI, CWI, and SARA.  (See
## my
##       <a href="Resume.html">
##        resume
##       </a>
##       .)  I created Python while at CWI.
##       <h3>
##        How to Reach Me
##       </h3>
##       <p>
##        You can send email for me to guido (at) python.org.
## I read everything sent there, but if you ask
## me a question about using Python, it's likely that I won't have time
## to answer it, and will instead refer you to
## help (at) python.org,
##        <a href="http://groups.google.com/groups?q=comp.lang.python">
##         comp.lang.python
##        </a>
##        or
##        <a href="http://stackoverflow.com">
##         StackOverflow
##        </a>
##        .  If you need to
## talk to me on the phone or send me something by snail mail, send me an
## email and I'll gladly email you instructions on how to reach me.
##        <h3>
##         My Name
##        </h3>
##        <p>
##         My name often poses difficulties for Americans.
##         <p>
##          <b>
##           Pronunciation:
##          </b>
##          in Dutch, the "G" in Guido is a hard G,
## pronounced roughly like the "ch" in Scottish "loch".  (Listen to the
##          <a href="guido.au">
##           sound clip
##          </a>
##          .)  However, if you're
## American, you may also pronounce it as the Italian "Guido".  I'm not
## too worried about the associations with mob assassins that some people
## have. :-)
##          <p>
##           <b>
##            Spelling:
##           </b>
##           my last name is two words, and I'd like to keep it
## that way, the spelling on some of my credit cards notwithstanding.
## Dutch spelling rules dictate that when used in combination with my
## first name, "van" is not capitalized: "Guido van Rossum".  But when my
## last name is used alone to refer to me, it is capitalized, for
## example: "As usual, Van Rossum was right."
##           <p>
##            <b>
##             Alphabetization:
##            </b>
##            in America, I show up in the alphabet under
## "V".  But in Europe, I show up under "R".  And some of my friends put
## me under "G" in their address book...
##            <h3>
##             More Hyperlinks
##            </h3>
##            <ul>
##             <li>
##              Here's a collection of
##              <a href="http://legacy.python.org/doc/essays/">
##               essays
##              </a>
##              relating to Python
## that I've written, including the foreword I wrote for Mark Lutz' book
## "Programming Python".
##              <p>
##               <li>
##                I own the official
##                <a href="images/license.jpg">
##                 <img align="center" border="0" height="75" src="images/license_thumb.jpg" width="100"/>
##                 Python license.
##                </a>
##                <p>
##                </p>
##               </li>
##              </p>
##             </li>
##            </ul>
##            <h3>
##             The Audio File Formats FAQ
##            </h3>
##            <p>
##             I was the original creator and maintainer of the Audio File Formats
## FAQ.  It is now maintained by Chris Bagwell
## at
##             <a href="http://www.cnpbagwell.com/audio-faq">
##              http://www.cnpbagwell.com/audio-faq
##             </a>
##             .  And here is a link to
##             <a href="http://sox.sourceforge.net/">
##              SOX
##             </a>
##             , to which I contributed
## some early code.
##            </p>
##           </p>
##          </p>
##         </p>
##        </p>
##       </p>
##      </p>
##     </p>
##    </p>
##   </p>
##  </body>
## </html>
## <hr/>
## <a href="images/internetdog.gif">
##  "On the Internet, nobody knows you're
## a dog."
## </a>
## <hr/>
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
## 
## The code that caused this warning is on line 119 of the file <string>. To get rid of this warning, change code that looks like this:
## 
##  BeautifulSoup(YOUR_MARKUP})
## 
## to this:
## 
##  BeautifulSoup(YOUR_MARKUP, "html.parser")
## 
##   markup_type=markup_type))
## 
## <title>Guido's Personal Home Page</title>
## 
## 
## Guido's Personal Home Page
## 
## 
## 
## 
## Guido van Rossum - Personal Home Page
## "Gawky and proud of it."
## Who
## I Am
## Read
## my "King's
## Day Speech" for some inspiration.
## 
## I am the author of the Python
## programming language.  See also my resume
## and my publications list, a brief bio, assorted writings, presentations and interviews (all about Python), some
## pictures of me,
## my new blog, and
## my old
## blog on Artima.com.  I am
## @gvanrossum on Twitter.  I
## also have
## a G+
## profile.
## 
## In January 2013 I joined
## Dropbox.  I work on various Dropbox
## products and have 50% for my Python work, no strings attached.
## Previously, I have worked for Google, Elemental Security, Zope
## Corporation, BeOpen.com, CNRI, CWI, and SARA.  (See
## my resume.)  I created Python while at CWI.
## 
## How to Reach Me
## You can send email for me to guido (at) python.org.
## I read everything sent there, but if you ask
## me a question about using Python, it's likely that I won't have time
## to answer it, and will instead refer you to
## help (at) python.org,
## comp.lang.python or
## StackOverflow.  If you need to
## talk to me on the phone or send me something by snail mail, send me an
## email and I'll gladly email you instructions on how to reach me.
## 
## My Name
## My name often poses difficulties for Americans.
## 
## Pronunciation: in Dutch, the "G" in Guido is a hard G,
## pronounced roughly like the "ch" in Scottish "loch".  (Listen to the
## sound clip.)  However, if you're
## American, you may also pronounce it as the Italian "Guido".  I'm not
## too worried about the associations with mob assassins that some people
## have. :-)
## 
## Spelling: my last name is two words, and I'd like to keep it
## that way, the spelling on some of my credit cards notwithstanding.
## Dutch spelling rules dictate that when used in combination with my
## first name, "van" is not capitalized: "Guido van Rossum".  But when my
## last name is used alone to refer to me, it is capitalized, for
## example: "As usual, Van Rossum was right."
## 
## Alphabetization: in America, I show up in the alphabet under
## "V".  But in Europe, I show up under "R".  And some of my friends put
## me under "G" in their address book...
## 
## 
## More Hyperlinks
## 
## Here's a collection of essays relating to Python
## that I've written, including the foreword I wrote for Mark Lutz' book
## "Programming Python".
## I own the official 
## Python license.
## 
## The Audio File Formats FAQ
## I was the original creator and maintainer of the Audio File Formats
## FAQ.  It is now maintained by Chris Bagwell
## at http://www.cnpbagwell.com/audio-faq.  And here is a link to
## SOX, to which I contributed
## some early code.
## 
## 
## 
## "On the Internet, nobody knows you're
## a dog."
## 
## 
## 
## pics.html
## http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm
## http://metalab.unc.edu/Dave/Dr-Fun/df200004/df20000406.jpg
## http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
## http://www.python.org
## Resume.html
## Publications.html
## bio.html
## http://legacy.python.org/doc/essays/
## http://legacy.python.org/doc/essays/ppt/
## interviews.html
## pics.html
## http://neopythonic.blogspot.com
## http://www.artima.com/weblogs/index.jsp?blogger=12088
## https://twitter.com/gvanrossum
## https://plus.google.com/u/0/115212051037621986145/posts
## http://www.dropbox.com
## Resume.html
## http://groups.google.com/groups?q=comp.lang.python
## http://stackoverflow.com
## guido.au
## http://legacy.python.org/doc/essays/
## images/license.jpg
## http://www.cnpbagwell.com/audio-faq
## http://sox.sourceforge.net/
## images/internetdog.gif

Acidity of Red Wine:


Chapter 2 - Interacting with APIs

Introduction to APIs (Application Programming Interface) and JSON (JavaScript Object Notation):

  • API is a protocol and routine for building and interacting with software applications
  • JSON helps with rel-time browser to server communication, developed by Douglas Crockford
  • JSON has name-value pairs, very similar to a Python dictionary
  • General process might include
    • import json
    • with open(“snakes.json”, “r”) as json_file: json_data = json.load(json_file) # json_data will be imported as a dictionary

APIs and interacting with the world-wide web - what APIs are and why they are important:

  • The API is a set of protocols and routines for interacting with software programs
  • The “Open Movies Database” (OMDB) has an API, as do most websites that might be data sources
  • Example usage might include
    • import requests
    • url = “http://www.omdbapi.com/?t=hackers” # the ? Represents a “query string”, in this case asking for “t” (title) equals “hackers” (the movie “Hackers”)
    • r = requests.get(url)
    • json_data = r.json()
  • Can get the OMDB API webpage for how they allow their data to be queried/used and how to fomat the relevant “query strings”

Example code includes:


myPath = "./PythonInputFiles/"

# DO NOT HAVE FILE a_movie.json, which appears to be JSON for the movie Social Network (2010)
# Created and saved file
import json

# Load JSON: json_data
with open(myPath + "a_movie.json") as json_file:
    json_data = json.load(json_file)

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])


# PROBABLY DO NOT RUN; NEED API KEY
# Import requests package
import requests

# Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=ff21610b&t=social+network'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Print the text of the response
print(r.text)

# Decode the JSON data into a dictionary: json_data
json_data = r.json()

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])


# Assign URL to variable: url
url = "https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza"

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Decode the JSON data into a dictionary: json_data
json_data = r.json()

# Print the Wikipedia page extract
pizza_extract = json_data['query']['pages']['24768']['extract']
print(pizza_extract)
## imdbRating:  7.7
## Rated:  PG-13
## Year:  2010
## DVD:  N/A
## Ratings:  [{'Value': '7.7/10', 'Source': 'Internet Movie Database'}, {'Value': '96%', 'Source': 'Rotten Tomatoes'}, {'Value': '95/100', 'Source': 'Metacritic'}]
## Metascore:  95
## Runtime:  120 min
## Released:  01 Oct 2010
## Plot:  Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
## Poster:  https://images-na.ssl-images-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg
## imdbVotes:  508,540
## Director:  David Fincher
## Website:  http://www.thesocialnetwork-movie.com/
## Writer:  Aaron Sorkin (screenplay), Ben Mezrich (book)
## Awards:  Won 3 Oscars. Another 162 wins & 162 nominations.
## Language:  English, French
## Genre:  Biography, Drama
## Type:  movie
## Country:  USA
## Production:  Columbia Pictures
## imdbID:  tt1285016
## Actors:  Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
## BoxOffice:  $96,400,000
## Response:  True
## Title:  The Social Network
## {"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Director":"David Fincher","Writer":"Aaron Sorkin (screenplay), Ben Mezrich (book)","Actors":"Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons","Plot":"Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.","Language":"English, French","Country":"USA","Awards":"Won 3 Oscars. Another 165 wins & 169 nominations.","Poster":"https://images-na.ssl-images-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"7.7/10"},{"Source":"Rotten Tomatoes","Value":"96%"},{"Source":"Metacritic","Value":"95/100"}],"Metascore":"95","imdbRating":"7.7","imdbVotes":"514,092","imdbID":"tt1285016","Type":"movie","DVD":"11 Jan 2011","BoxOffice":"$96,400,000","Production":"Columbia Pictures","Website":"http://www.thesocialnetwork-movie.com/","Response":"True"}
## Title:  The Social Network
## Year:  2010
## Rated:  PG-13
## Released:  01 Oct 2010
## Runtime:  120 min
## Genre:  Biography, Drama
## Director:  David Fincher
## Writer:  Aaron Sorkin (screenplay), Ben Mezrich (book)
## Actors:  Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
## Plot:  Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
## Language:  English, French
## Country:  USA
## Awards:  Won 3 Oscars. Another 165 wins & 169 nominations.
## Poster:  https://images-na.ssl-images-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg
## Ratings:  [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
## Metascore:  95
## imdbRating:  7.7
## imdbVotes:  514,092
## imdbID:  tt1285016
## Type:  movie
## DVD:  11 Jan 2011
## BoxOffice:  $96,400,000
## Production:  Columbia Pictures
## Website:  http://www.thesocialnetwork-movie.com/
## Response:  True
## <p><b>Pizza</b> is a yeasted flatbread typically topped with tomato sauce and cheese and baked in an oven. It is commonly topped with a selection of meats, vegetables and condiments.</p>
## <p>The term <i>pizza</i> was first recorded in the 10th century, in a Latin manuscript from Gaeta in Central Italy. Modern pizza was invented in Naples, Italy, and the dish and its variants have since become popular and common in many areas of the world. In 2009, upon Italy's request, Neapolitan pizza was safeguarded in the European Union as a Traditional Speciality Guaranteed dish. <i>Associazione Verace Pizza Napoletana</i> (True Neapolitan Pizza Association), a non-profit organization founded in 1984 with headquarters in Naples, aims to "promote and protect... the true Neapolitan pizza".</p>
## <p>Pizza is sold fresh or frozen, either whole or in portions, and is a common fast food item in Europe and North America. Various types of ovens are used to cook them and many varieties exist. Several similar dishes are prepared from ingredients commonly used in pizza preparation, such as calzone and stromboli.</p>

Chapter 3 - Diving deeper in to the Twitter API

Twitter API and Authentication - 1) Twitter API, 2) filtering tweets, 3) API Authentication and Oauth, 4) Python package “tweepy”:

  • The Twitter API requires a Twitter account, then creating a new Twitter App, then copying over the Token and Token Secret
  • Twitter has many API including the REST API (Representational State API), which allows for reading and writing Twitter data
  • The Twitter Streaming API includes a “Public streams” for low-latency access to tweets
  • The Twitter Firehose API is not publicly avaiable, requires special permission, and would likely be very expensive
  • Tweets are generally returned as JSON
  • The “tweepy” package has a nice balance between functionality and usability
    • auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    • auth.set_access(access_token, access_token_secret)

Example code includes:


# DO NOT RUN THIS - NO IDEA WHOSE KEYS THESE ARE (DataCamp???)
# Import package
import tweepy

# Store OAuth authentication credentials in relevant variables
access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"

# Pass OAuth details to tweepy's OAuth handler
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)


# The class MyStreamListener is available at https://gist.github.com/hugobowne/18f1c0c0709ed1a52dc5bcd462ac69f4
# Initialize Stream listener
l = MyStreamListener()

# Create you Stream object with authentication
stream = tweepy.Stream(auth, l)

# Filter Twitter Streams to capture data by the keywords:
stream.filter(track=['clinton', 'trump', 'sanders', 'cruz'])


# Import package
import json

# String of path to file: tweets_data_path
tweets_data_path = "tweets.txt"

# Initialize empty list to store tweets: tweets_data
tweets_data = []

# Open connection to file
tweets_file = open(tweets_data_path, "r")

# Read in tweets and store in list: tweets_data
for line in tweets_file:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())


# Import package
import pandas as pd

# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=["text", "lang"])

# Print head of DataFrame
print(df.head())


def word_in_text(word, tweet):
    word = word.lower()
    text = tweet.lower()
    match = re.search(word, tweet)

    if match:
        return True
    return False


# Initialize list to store tweet counts
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]

# Iterate through df, counting the number of tweets in which
# each candidate is mentioned
for index, row in df.iterrows():
    clinton += word_in_text('clinton', row['text'])
    trump += word_in_text('trump', row['text'])
    sanders += word_in_text('sanders', row['text'])
    cruz += word_in_text('cruz', row['text'])


# Import packages
import matplotlib.pyplot as plt
import seaborn as sns

# Set seaborn style
sns.set(color_codes=True)

# Create a list of labels:cd
cd = ['clinton', 'trump', 'sanders', 'cruz']

# Plot histogram
ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()

Cleaning Data in Python

Chapter 1 - Exploring Your Data

Diagnose data for cleaning - column names, missing data, outliers, duplicate rows, un-tidy data, unexpected data values, etc.:

  • Pandas can be identified/filtered using row/column names or row/column indices
  • Missing data are typically NaN in Python
  • For a pandas DataFrame df, df.head() and df.tail() will show the first/last 5 rows
    • df.columns returns an index of column names, which can reveal leading/trailing spaces
    • df.shape is analogous to dim() in R
    • df.info() will give a summary of the frame, as well as the associated columns (data types, non-missing values, and the like) - note that type “object” means it is non-numeric

Exploratory data analysis - suppose that a pandas DataFrame, df, has already been created:

  • To get frequency counts, use df.continent.value_counts(dropna=False) # in this case “continent” is the column name (can also subset using bracket notation, which is required if name has any “problems”). . .
    • Frequency counts will be in descending order
    • df[“continent”].value_counts(dropna=False) will return the same thing
  • To get summaries of numeric data, use df.describe() # will only be run for numeric columns

Visual exploratory data analysis - easy way to spot outliers and obvious errors - assume again that a pandas DataFrame, df, has already been explained:

  • Bar plots for discrete data
  • Histograms for contiuous data - df[“myColumn”].plot(“hist”) will create the histogram, and plt.show() will then show the histogram
  • Data can be subset similar to R - df[df[“myVar”] condition] will pull only the rows where the specified condition is met
  • Box plots can be handy ways to summarize the numerical data - df.boxplot(column=“myColumn”, by=“myByVariable”)
  • Scatter plots can be handy ways to look at relationships between two numeric columns

Example code includes:


# Downloaded small portion to myPath from https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2/data

# tempData = pd.read_csv(myPath + "DOB_JOB_Application_Filings.csv")

# keyCols = ["Borough", "State", "Site Fill", "Existing Zoning Sqft", "Initial Cost", "Total Est. Fee"]
# useData = tempData[keyCols]
# useData.loc[:, "initial_cost"] = [float(d[1:]) for d in useData["Initial Cost"]]
# useData.loc[:, "total_est_fee"] = [float(d[1:]) for d in useData["Total Est. Fee"]]
# useData.to_csv(myPath + "dob_job_application_filings_subset.csv")

# MAY NEED TO GET DATA FROM https://opendata.cityofnewyork.us/
# Import pandas
import pandas as pd

myPath = "./PythonInputFiles/"


# Read the file into a DataFrame: df
df = pd.read_csv(myPath + 'dob_job_application_filings_subset.csv')

# Print the head of df
print(df.head())

# Print the tail of df
print(df.tail())

# Print the shape of df
print(df.shape)

# Print the columns of df
print(df.columns)

# Print the info of df
print(df.info())


# Print the value counts for 'Borough'
print(df['Borough'].value_counts(dropna=False))

# Print the value_counts for 'State'
print(df['State'].value_counts(dropna=False))

# Print the value counts for 'Site Fill'
print(df['Site Fill'].value_counts(dropna=False))


# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Plot the histogram
df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)

# Display the histogram
# plt.show()
plt.savefig("_dummyPy045.png", bbox_inches="tight")
plt.clf()

# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt

# Create the boxplot
df.boxplot(column="initial_cost", by="Borough", rot=90)

# Display the plot
# plt.show()
plt.savefig("_dummyPy046.png", bbox_inches="tight")
plt.clf()

# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt

# Create and display the first scatter plot
df.plot(kind="scatter", x="initial_cost", y="total_est_fee", rot=70)
# plt.show()
plt.savefig("_dummyPy047.png", bbox_inches="tight")
plt.clf()
##    Unnamed: 0    Borough State            Site Fill  Existing Zoning Sqft  \
## 0           0   BROOKLYN    NY  USE UNDER 300 CU.YD                     0   
## 1           1   BROOKLYN    NY                  NaN                     0   
## 2           2  MANHATTAN    NY       NOT APPLICABLE                     0   
## 3           3     QUEENS    NY       NOT APPLICABLE                     0   
## 4           4   BROOKLYN    NY       NOT APPLICABLE                     0   
## 
##   Initial Cost Total Est. Fee  initial_cost  total_est_fee  
## 0        $0.00        $420.00           0.0          420.0  
## 1        $0.00        $170.00           0.0          170.0  
## 2    $60000.00        $831.50       60000.0          831.5  
## 3    $31000.00        $692.80       31000.0          692.8  
## 4     $3000.00        $225.00        3000.0          225.0  
##      Unnamed: 0   Borough State            Site Fill  Existing Zoning Sqft  \
## 138         138    QUEENS    NY                  NaN                     0   
## 139         139    QUEENS    NY       NOT APPLICABLE                     0   
## 140         140  BROOKLYN    NY       NOT APPLICABLE                     0   
## 141         141  BROOKLYN    NY  USE UNDER 300 CU.YD                     0   
## 142         142     BRONX    NY                  NaN                     0   
## 
##     Initial Cost Total Est. Fee  initial_cost  total_est_fee  
## 138    $63000.00        $832.40       63000.0          832.4  
## 139    $21000.00        $212.40       21000.0          212.4  
## 140     $2800.00        $395.00        2800.0          395.0  
## 141        $0.00        $472.00           0.0          472.0  
## 142        $0.00        $170.00           0.0          170.0  
## (143, 9)
## Index(['Unnamed: 0', 'Borough', 'State', 'Site Fill', 'Existing Zoning Sqft',
##        'Initial Cost', 'Total Est. Fee', 'initial_cost', 'total_est_fee'],
##       dtype='object')
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 143 entries, 0 to 142
## Data columns (total 9 columns):
## Unnamed: 0              143 non-null int64
## Borough                 143 non-null object
## State                   143 non-null object
## Site Fill               120 non-null object
## Existing Zoning Sqft    143 non-null int64
## Initial Cost            143 non-null object
## Total Est. Fee          143 non-null object
## initial_cost            143 non-null float64
## total_est_fee           143 non-null float64
## dtypes: float64(2), int64(2), object(5)
## memory usage: 7.3+ KB
## None
## MANHATTAN        66
## BROOKLYN         44
## QUEENS           16
## STATEN ISLAND    10
## BRONX             7
## Name: Borough, dtype: int64
## NY    136
## NJ      6
## NC      1
## Name: State, dtype: int64
## NOT APPLICABLE         108
## NaN                     23
## USE UNDER 300 CU.YD      8
## ON-SITE                  4
## Name: Site Fill, dtype: int64

NYC Open Data Sub-sample (Building Permits - Existing Zoning Sq Ft):

NYC Open Data Sub-sample (Building Permits - Initial Cost by Borough):

NYC Open Data Sub-sample (Building Permits):


Chapter 2 - Tidying data for analysis

Tidy data per the Hadley Wickham paper - “standard way to organize data within a dataset”:

  • Columns should represent separate variables - if the values are in the column names, then the data need to be melted
    • pd.melt(frame=myFrame, id_vars=[“myID”], value_vars=[“myValues”])
    • id_vars will be held fixed; these are the columns that will not be changed during the melting process
    • If the value_vars columns are not specified, Python will assume you want to melt all columns other than the ID variables
    • The default outputs new columns “variable” and “value”, though these can be over-ridden using var_name=“myVarName” and value_name=“myValName” inside melt
  • Rows should represent individual observations
  • Observational units should form tables
  • There are some trade-offs in reporting vs. data analysis, and tidying the data is primarily for making it easier to analyze

Pivoting data is the opposite of melting; turn unique values in to separate columns (assuming again that the DataFrame, df, already exists):

  • To pivot the data, use df.pivot(index=“myIndex”, columns=“myColumns”, values=“myValues”)
    • index is the columns to be fixed
    • columns is what is to be pivoted in the new columns
    • values is what is to be placed in to the new columns
  • If there is a duplicate value, the “pivot table” is required, specifying to Python how the duplicated value should be managed
    • df.pivot_table(index=“myIndex”, columns=“myColumns”, values=“myValues”, aggfunc=myFunc) # same as .pivot, but specifying something like np.mean for how to handle duplicates

Beyond melt and pivot - example from the Wickham data of having a single variable that combines sex and age-group (TB data) - common shape for reporting, but less than ideal for analysis:

  • First, melt the data so that all these columns become a single “variable” that contains the associated “value”
  • Second, create new variables for sex and age from the current variable named “variable”
    • tb_melt[“sex”] = tb_melt[“variable”].str[0] # will extract the first character, which is the sex in this case
    • tb_melt[“age”] = tb_melt[“variable”].str[1:] # will extract all but the first character

Example code includes:


# THIS SEEMS TO BE THE STANARD R datasets file as a pandas
# Saved airquality.csv to the ./PythonInputFiles

myPath = "./PythonInputFiles/"



import pandas as pd
import numpy as np

airquality = pd.read_csv(myPath + "airquality.csv")


# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=["Month", "Day"])

# Print the head of airquality_melt
print(airquality_melt.head())


# Print the head of airquality
print(airquality.head())

# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=["Month", "Day"], var_name="measurement", value_name="reading")

# Print the head of airquality_melt
print(airquality_melt.head())


# Print the head of airquality_melt
print(airquality_melt.head())

# airquality_melt.pivot() would bomb out on this; not sure why . . . (may be due to having 2+ variables in the index
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=["Month", "Day"], columns="measurement", values="reading")

# Print the head of airquality_pivot
print(airquality_pivot.head())


# Print the index of airquality_pivot
print(airquality_pivot.index)

# Reset the index of airquality_pivot: airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the new index of airquality_pivot
print(airquality_pivot.index)

# Print the head of airquality_pivot
print(airquality_pivot.head())


# Pivot airquality_dup: airquality_pivot
# keyRows = [x for x in range(len(airquality.index))] + [2, 4, 6, 8, 10]
# airquality_dup = airquality.iloc[keyRows, :]
airquality_pivot = airquality_melt.pivot_table(index=["Month", "Day"], columns="measurement", values="reading", aggfunc=np.mean)

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
print(airquality_pivot.head())

# Print the head of airquality
print(airquality.head())


# tb is 201x18 with variables ['country', 'year', 'm014', 'm1524', 'm2534', 'm3544', 'm4554', 'm5564', 'm65', 'mu', 'f014', 'f1524', 'f2534', 'f3544', 'f4554', 'f5564', 'f65', 'fu']
# year is set to be always 2000 with fu and mu always NaN
# Create dummy data for tb (just use 3 countries and the 014 and 1524 columns)
tb = pd.DataFrame( { "country":["USA", "CAN", "MEX"] , "year":2000 , "fu":np.nan , "mu":np.nan , "f014":[2, 3, 4] , "m014":[5, 6, 7] , "f1524": [8, 9, 0] , "m1524":[1, 2, 3] } )


# Melt tb: tb_melt
tb_melt = pd.melt(tb, id_vars=["country", "year"])

# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

# Print the head of tb_melt
print(tb_melt.head())  # Is now 3,216 x 6 ['country', 'year', 'variable', 'value', 'gender', 'age_group']


# Ebola dataset is available at https://data.humdata.org/dataset/ebola-cases-2014
# Variables are split by an underscore 'Date', 'Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone', 'Cases_Nigeria', 'Cases_Senegal', 'Cases_UnitedStates', 'Cases_Spain', 'Cases_Mali', 'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone', 'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates', 'Deaths_Spain', 'Deaths_Mali'

# Downloaded file, then manipulated to be like the above as follows:
# ebola_test = pd.read_csv(myPath + "ebola_data_db_format.csv")
# ebola_test["UseCountry"] = ebola_test["Country"].str.replace(" ", "")
# ebola_test["UseCountry"] = ebola_test["UseCountry"].str.replace("2", "")
# keyIndic = ["Cumulative number of confirmed Ebola deaths", "Cumulative number of confirmed Ebola cases"]
# keyBool = [x in keyIndic for x in ebola_test["Indicator"]]
# ebola_test = ebola_test.loc[keyBool, :]
# indicMap = {keyIndic[0]:"Deaths", keyIndic[1]:"Cases"}
# ebola_test["UseIndicator"] = ebola_test["Indicator"].map(indicMap)
# ebolaPre = ebola_test[["Date", "UseCountry", "UseIndicator", "value"]]
# ebolaPre["CI"] = ebolaPre["UseIndicator"] + "_" + ebolaPre["UseCountry"]
# ebolaSave = ebolaPre.pivot_table(index="Date", columns="CI", values="value", aggfunc="max").fillna(method="ffill").fillna(0)
# ebolaSave.to_csv(myPath + "ebola.csv")

ebola = pd.read_csv(myPath + "ebola.csv", parse_dates=["Date"])


# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=["Date"], var_name="type_country", value_name="counts")

# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt["type_country"].str.split("_")

# Create the 'type' column
ebola_melt['type'] = ebola_melt['str_split'].str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt['str_split'].str.get(1)

# Print the head of ebola_melt
print(ebola_melt.head())


# ebola_melt.to_csv(myPath + "ebola_melt.csv", index=False)
# Run outside of this shell so that the file is accessible later
##    Ozone  Solar.R  Wind  Temp  Month  Day
## 0   41.0    190.0   7.4    67      5    1
## 1   36.0    118.0   8.0    72      5    2
## 2   12.0    149.0  12.6    74      5    3
## 3   18.0    313.0  11.5    62      5    4
## 4    NaN      NaN  14.3    56      5    5
##    Month  Day variable  value
## 0      5    1    Ozone   41.0
## 1      5    2    Ozone   36.0
## 2      5    3    Ozone   12.0
## 3      5    4    Ozone   18.0
## 4      5    5    Ozone    NaN
##    Ozone  Solar.R  Wind  Temp  Month  Day
## 0   41.0    190.0   7.4    67      5    1
## 1   36.0    118.0   8.0    72      5    2
## 2   12.0    149.0  12.6    74      5    3
## 3   18.0    313.0  11.5    62      5    4
## 4    NaN      NaN  14.3    56      5    5
##    Month  Day measurement  reading
## 0      5    1       Ozone     41.0
## 1      5    2       Ozone     36.0
## 2      5    3       Ozone     12.0
## 3      5    4       Ozone     18.0
## 4      5    5       Ozone      NaN
##    Month  Day measurement  reading
## 0      5    1       Ozone     41.0
## 1      5    2       Ozone     36.0
## 2      5    3       Ozone     12.0
## 3      5    4       Ozone     18.0
## 4      5    5       Ozone      NaN
## measurement  Ozone  Solar.R  Temp  Wind
## Month Day                              
## 5     1       41.0    190.0  67.0   7.4
##       2       36.0    118.0  72.0   8.0
##       3       12.0    149.0  74.0  12.6
##       4       18.0    313.0  62.0  11.5
##       5        NaN      NaN  56.0  14.3
## MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
##            labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]],
##            names=['Month', 'Day'])
## RangeIndex(start=0, stop=153, step=1)
## measurement  Month  Day  Ozone  Solar.R  Temp  Wind
## 0                5    1   41.0    190.0  67.0   7.4
## 1                5    2   36.0    118.0  72.0   8.0
## 2                5    3   12.0    149.0  74.0  12.6
## 3                5    4   18.0    313.0  62.0  11.5
## 4                5    5    NaN      NaN  56.0  14.3
## measurement  Month  Day  Ozone  Solar.R  Temp  Wind
## 0                5    1   41.0    190.0  67.0   7.4
## 1                5    2   36.0    118.0  72.0   8.0
## 2                5    3   12.0    149.0  74.0  12.6
## 3                5    4   18.0    313.0  62.0  11.5
## 4                5    5    NaN      NaN  56.0  14.3
##    Ozone  Solar.R  Wind  Temp  Month  Day
## 0   41.0    190.0   7.4    67      5    1
## 1   36.0    118.0   8.0    72      5    2
## 2   12.0    149.0  12.6    74      5    3
## 3   18.0    313.0  11.5    62      5    4
## 4    NaN      NaN  14.3    56      5    5
##   country  year variable  value gender age_group
## 0     USA  2000     f014    2.0      f       014
## 1     CAN  2000     f014    3.0      f       014
## 2     MEX  2000     f014    4.0      f       014
## 3     USA  2000    f1524    8.0      f      1524
## 4     CAN  2000    f1524    9.0      f      1524
##         Date  type_country  counts        str_split   type country
## 0 2014-08-29  Cases_Guinea   482.0  [Cases, Guinea]  Cases  Guinea
## 1 2014-09-05  Cases_Guinea   604.0  [Cases, Guinea]  Cases  Guinea
## 2 2014-09-08  Cases_Guinea   664.0  [Cases, Guinea]  Cases  Guinea
## 3 2014-09-12  Cases_Guinea   678.0  [Cases, Guinea]  Cases  Guinea
## 4 2014-09-16  Cases_Guinea   743.0  [Cases, Guinea]  Cases  Guinea

Chapter 3 - Combining data for analysis

Concatenating data - data may be in separate files (too many records, time series data by day, etc.), while you want to combine it:

  • Concatenating data (similar to rbind in R) in Python leaves the original row indices untouched, which can induce duplicate indices
  • The pd.concat([myFileList]) will place the frames together in a single frame # requires that import pandas as pd was called previously
    • Using ignore_index=True inside pd.concat() will re-cast the row indices from 0 to n-1
    • To instead concatenate columns (similar to cbind in R), declare the axis=1 option inside pd.concat()

Finding and concatenating data - issue of many files needing to be concatenated:

  • The glob function from the glob library helps to find files based on a consistent search pattern
    • The wildcards * and ? Are both available, with * being any number while ? Is a single character
    • Basic usage would be glob.glob(“mySearchString”)
  • The basic plan would be to 1) load all the files to pandas, and then 2) make a list of the DataFrame names for concatenation

Merge data - extension on concatenation (which is more piecing something back together that was originally one piece but became split):

  • Merging can manage joins of tables that never were one piece, combining disparate data based on common columns
  • The merge syntax is pd.merge(left=leftFrame, right=rightFrame, how=, on=, left_on=, right_on=) # default is an INNER JOIN and need to specify either on= (common variables) or left_on=/right_on=
    • The defaults for on, left_on, and right_on are all None
    • The default is for how=“inner” though options for “left”, “right”, and “outer” can also be declared

Example code includes:


myPath = "./PythonInputFiles/"

import pandas as pd
import numpy as np


# uber datasets are a small subset from within http://data.beta.nyc/dataset/uber-trip-data-foiled-apr-sep-2014
# downloaded file "Uber-Jan-Feb-FOIL.csv" to myPath


uber = pd.read_csv(myPath + "Uber-Jan-Feb-FOIL.csv")

cuts = [round(len(uber.index) / 3), round(2 * len(uber.index) / 3)]

uber1 = uber.iloc[:cuts[0], :]
uber2 = uber.iloc[cuts[0]:cuts[1], :]
uber3 = uber.iloc[cuts[1]:, :]

# Save outside of this routine
# uber1.to_csv(myPath + "uber1.csv", index=False)
# uber2.to_csv(myPath + "uber2.csv", index=False)
# uber3.to_csv(myPath + "uber3.csv", index=False)


# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1, uber2, uber3])

# Print the shape of row_concat
print(row_concat.shape)

# Print the head of row_concat
print(row_concat.head())

print(np.sum(row_concat != uber))


# ebola_melt is 1,952x4 of Date-Day-status_country-counts
# status_country is 1,952x2 of status-country (the previous status_country has been string split)
# Create this from the file in the previous exercise
ebola_melt = pd.read_csv(myPath + "ebola_melt.csv", parse_dates=["Date"])
ebola_melt.columns = ["Date", "status_country", "counts", "str_split", "status", "country"]

status_country = ebola_melt[["status", "country"]]
ebola_melt = ebola_melt[["Date", "status_country", "counts"]]

# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt, status_country], axis=1)

# Print the shape of ebola_tidy
print(ebola_tidy.shape)

# Print the head of ebola_tidy
print(ebola_tidy.head())


# Has files ['uber-raw-data-2014_06.csv', 'uber-raw-data-2014_04.csv', 'uber-raw-data-2014_05.csv'] available
# Date/Time-Lat-Lon-Base
# Import necessary modules
import glob
import pandas as pd

# Write the pattern: pattern
# This is designed to get the uber1.csv, uber2.csv, and uber3.csv files
pattern = myPath + 'uber?.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)

# Print the file names
print(csv_files)

# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])

# Print the head of csv2
print(csv2.head())


# Create an empty list: frames
frames = []

#  Iterate over csv_files
for csv in csv_files:
    
    #  Read csv into a DataFrame: df
    df = pd.read_csv(csv)
    
    # Append df to frames
    frames.append(df)

# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)

# Print the shape of uber
print(uber.shape)

# Print the head of uber
print(uber.head())


# site is a 3x3 with name-lat-long - name=["DR-1", "DR-3", "MSK-4"], lat=[-50, -47, -48.9], lon=[-129, -127, -123.4]
# visited is a 3x3 with ident-site-dated - ident=[619, 734, 837], site=["DR-1", "DR-3", "MSK-4"], dated=["1927-02", "1939-01", "1932-01"]

site = pd.DataFrame( { "name":["DR-1", "DR-3", "MSK-4"], "lat":[-50, -47, -48.9], "lon":[-129, -127, -123.4] } )
visited = pd.DataFrame( { "ident":[619, 734, 837], "site":["DR-1", "DR-3", "MSK-4"], "dated":["1927-02", "1939-01", "1932-01"] } )

# Merge the DataFrames: o2o
o2o = pd.merge(left=site, right=visited, left_on=["name"], right_on=["site"])

# Print o2o
print(o2o)


# now make visited 8x3 with ident=[619, 622, 734, 735, , 751, 752, 837, 844], site=['DR-1', 'DR-1', 'DR-3', 'DR-3', 'DR-3', 'DR-3', 'MSK-4', 'DR-1'], dated=['1927-02-08', '1927-02-10', '1939-01-07', '1930-01-12', '1930-02-26', nan, '1932-01-14', '1932-03-22']

visited = pd.DataFrame( {"ident":[619, 622, 734, 735, 751, 752, 837, 844], "site":['DR-1', 'DR-1', 'DR-3', 'DR-3', 'DR-3', 'DR-3', 'MSK-4', 'DR-1'], "dated":['1927-02-08', '1927-02-10', '1939-01-07', '1930-01-12', '1930-02-26', np.nan, '1932-01-14', '1932-03-22']} )

# Merge the DataFrames: m2o
m2o = pd.merge(left=site, right=visited, left_on=["name"], right_on=["site"])

# Print m2o
print(m2o)


# add an additional frame surveyed which is 21x4 with taken-person-quant-reading (taken matched ident in file visited)
# Merge site and visited: m2m
# m2m = pd.merge(left=site, right=visited, left_on=["name"], right_on=["site"])

# Merge m2m and survey: m2m
# m2m = pd.merge(left=m2m, right=survey, left_on=["ident"], right_on=["taken"])

# Print the first 20 lines of m2m
# print(m2m.head(20))
## (354, 4)
##   dispatching_base_number      date  active_vehicles  trips
## 0                  B02512  1/1/2015              190   1132
## 1                  B02765  1/1/2015              225   1765
## 2                  B02764  1/1/2015             3427  29421
## 3                  B02682  1/1/2015              945   7679
## 4                  B02617  1/1/2015             1228   9537
## dispatching_base_number    0
## date                       0
## active_vehicles            0
## trips                      0
## dtype: int64
## (5180, 5)
##         Date status_country  counts status country
## 0 2014-08-29   Cases_Guinea   482.0  Cases  Guinea
## 1 2014-09-05   Cases_Guinea   604.0  Cases  Guinea
## 2 2014-09-08   Cases_Guinea   664.0  Cases  Guinea
## 3 2014-09-12   Cases_Guinea   678.0  Cases  Guinea
## 4 2014-09-16   Cases_Guinea   743.0  Cases  Guinea
## ['./PythonInputFiles\\uber1.csv', './PythonInputFiles\\uber2.csv', './PythonInputFiles\\uber3.csv']
##   dispatching_base_number       date  active_vehicles  trips
## 0                  B02765  1/20/2015              272   1608
## 1                  B02617  1/20/2015             1350  10015
## 2                  B02764  1/21/2015             3718  27344
## 3                  B02512  1/21/2015              242   1519
## 4                  B02682  1/21/2015             1228   9472
## (354, 4)
##   dispatching_base_number      date  active_vehicles  trips
## 0                  B02512  1/1/2015              190   1132
## 1                  B02765  1/1/2015              225   1765
## 2                  B02764  1/1/2015             3427  29421
## 3                  B02682  1/1/2015              945   7679
## 4                  B02617  1/1/2015             1228   9537
##     lat    lon   name    dated  ident   site
## 0 -50.0 -129.0   DR-1  1927-02    619   DR-1
## 1 -47.0 -127.0   DR-3  1939-01    734   DR-3
## 2 -48.9 -123.4  MSK-4  1932-01    837  MSK-4
##     lat    lon   name       dated  ident   site
## 0 -50.0 -129.0   DR-1  1927-02-08    619   DR-1
## 1 -50.0 -129.0   DR-1  1927-02-10    622   DR-1
## 2 -50.0 -129.0   DR-1  1932-03-22    844   DR-1
## 3 -47.0 -127.0   DR-3  1939-01-07    734   DR-3
## 4 -47.0 -127.0   DR-3  1930-01-12    735   DR-3
## 5 -47.0 -127.0   DR-3  1930-02-26    751   DR-3
## 6 -47.0 -127.0   DR-3         NaN    752   DR-3
## 7 -48.9 -123.4  MSK-4  1932-01-14    837  MSK-4

Chapter 4 - Cleaning data for analysis

Data types and conversions - can see the data types using the df.dtypes attribute of a pandas DataFrame df:

  • Often helpful to convert strings to numerics or vice versa
  • The .astype() method will allow for type conversions
    • df[“a”] = df[“a”].astype(str) will create a string variable
    • df[“a”] = df[“a”].astype(“category”) will create a categorical (factor) variable
    • df[“a”] = pd.to_numeric(df[“a”], errors=“coerce”) will create a numeric variable, with NaN written where the string is not a sensible numeric

Using regular expressions to clean strings - the most common form of data cleaning is string manipulation:

  • As an example, monetary values can be represented in many ways
  • The “re” library is used for pattern matching (using regular expressions) within strings
    • The asterisk (*) means “0 or more times”
    • The plus sign (+) means “1 or more times”
    • The represents any digit, broadly the same as [0-9], so means zero or more consecutive digits
    • The $ means the actual “$” symbol, with the back-slash escaping the symbol from its default meaning as “end-of-string”; so $ will match the dollar sign followed by 0+ digits
    • The . Means the actual “.” symbol, with the back-slash escaping the default meaning of the period; so $. will maktc the dollar sign followed by 0+ digits followed by the period followed by 0+ digits
    • The {2} means to have exactly two of the items; sp $. will match “$[0+ digits].[2 digits]”
    • The carat means “at the start” and the dollar means “at the end”, so ^$.$ will not match anything with 3+ digits after the period, nor anything with text before the $
  • Best practices for using an re are to 1) compile the pattern first, then 2) apply the compiled pattern to the pattern
    • pattern = re.compile(“myRegEx”) will compile the specified regular expression for use elsewhere
    • result = pattern.match(“myText”) will then pull out the relevant matches to the compiled pattern
    • bool(result) will return a True/False as to whether we made any matches

Using functions to clean data - in particular, the .apply() function:

  • df.apply(myFunction, myAxis=) # axis 0 is columns, axis 1 is rows, etc.
  • Example using a few columns with dollar data - check that valid numbers, remove the dollar sign, cast as numeric (NaN if invalid data), store as new column
    • from numpy import NaN
    • myVar.replace(“\(", "") will replace the "\)” with “” (more or less, remove the leading dollar signs
  • Frequently, a function will be passed an entire row of data, so the cleaning can be done for all required variables in the same function
    • df[“myNewVar”] = df.apply(myFunc, axis=1, pattern=pattern) # will pass the rows as argument 1 and pattern as argument 2 to myFunc, once for each row

Duplicate and missing data - can skew results in undesirable manners:

  • The df.drop_duplicates() method will remove any rows that are exact duplicates of each other
  • The df.info() method is a nice way to see how much missing data there is by variable
  • The df.dropna() method will drop any rows that have any NaN included in them (keeps only the complete.cases() in R syntax)
  • The df.fillna() allows for replacing either a user-provided value or a calculated value (such as mean/median for the variable where it exists)
    • Can run through multiple columns at the same time by encoding a list; df[[myList]] = df[[myList]].fillna(0) will make NaN in to zero in every column specified in myList

Testing with asserts - early detection for problems that may plague the analysis later:

  • More or less, assert myExpression does nothing if True and errors out if False

Example code includes:


# The tips data is available at https://github.com/mwaskom/seaborn-data/blob/master/tips.csv

myPath = "./PythonInputFiles/"

import pandas as pd
import numpy as np

tips = pd.read_csv(myPath + "tips.csv")


# Convert the sex column to type 'category'
tips.sex = tips["sex"].astype("category")

# Convert the smoker column to type 'category'
tips.smoker = tips["smoker"].astype("category")

# Print the info of tips
print(tips.info())


# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips["total_bill"], errors="coerce")

# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips["tip"], errors="coerce")

# Print the info of tips
print(tips.info())


# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches
result = prog.match("1123-456-7890")
print(bool(result))


# Import the regular expression module
import re

# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches
print(matches)


# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)

# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)

# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)


import numpy

# Define recode_sex()
def recode_sex(sex_value):
    
    # Return 1 if sex_value is 'Male'
    if sex_value == "Male":
        return 1
    
    # Return 0 if sex_value is 'Female'    
    elif sex_value == "Female":
        return 0
    
    # Return np.nan    
    else:
        return np.nan

# Apply the function to the sex column
tips['sex_recode'] = tips["sex"].apply(recode_sex)


# Create the total_dollar field
tips["total_dollar"] = "$" + tips["total_bill"].astype(str)

# Write the lambda function using replace
tips['total_dollar_replace'] = tips["total_dollar"].apply(lambda x: x.replace('$', ''))

# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips["total_dollar"].apply(lambda x: re.findall('\d+\.\d+', x))

# Print the head of tips
print(tips.head())


# DO NOT HAVE DATASET "tracks"
# Create the new DataFrame: tracks
# tracks = billboard[['year', 'artist', 'track', 'time']]

# Print info of tracks
# print(tracks.info())

# Drop the duplicates: tracks_no_duplicates
# tracks_no_duplicates = tracks.drop_duplicates()

# Print info of tracks
# print(tracks_no_duplicates.info())


# SEEMS TO BE "airquality" as per the R datasets package
# Previously saved as myPath + "airquality.csv"
airquality = pd.read_csv(myPath + "airquality.csv")


# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality["Ozone"].mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality["Ozone"].fillna(oz_mean)

# Print the info of airquality
print(airquality.info())


# DO NOT HAVE FRAME ebola - 122 x 18 of Date-Day-Cases_[8 countries]-Deaths_[8 countries]
# Use the version saved previously
ebola = pd.read_csv(myPath + "ebola.csv", parse_dates=["Date"])

# Assert that there are no missing values
assert ebola.notnull().all().all()

# Assert that all values are >= 0
assert (ebola >= 0).all().all()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 244 entries, 0 to 243
## Data columns (total 7 columns):
## total_bill    244 non-null float64
## tip           244 non-null float64
## sex           244 non-null category
## smoker        244 non-null category
## day           244 non-null object
## time          244 non-null object
## size          244 non-null int64
## dtypes: category(2), float64(2), int64(1), object(2)
## memory usage: 8.2+ KB
## None
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 244 entries, 0 to 243
## Data columns (total 7 columns):
## total_bill    244 non-null float64
## tip           244 non-null float64
## sex           244 non-null category
## smoker        244 non-null category
## day           244 non-null object
## time          244 non-null object
## size          244 non-null int64
## dtypes: category(2), float64(2), int64(1), object(2)
## memory usage: 8.2+ KB
## None
## True
## False
## ['10', '1']
## True
## True
## True
##    total_bill   tip     sex smoker  day    time  size sex_recode total_dollar  \
## 0       16.99  1.01  Female     No  Sun  Dinner     2          0       $16.99   
## 1       10.34  1.66    Male     No  Sun  Dinner     3          1       $10.34   
## 2       21.01  3.50    Male     No  Sun  Dinner     3          1       $21.01   
## 3       23.68  3.31    Male     No  Sun  Dinner     2          1       $23.68   
## 4       24.59  3.61  Female     No  Sun  Dinner     4          0       $24.59   
## 
##   total_dollar_replace total_dollar_re  
## 0                16.99         [16.99]  
## 1                10.34         [10.34]  
## 2                21.01         [21.01]  
## 3                23.68         [23.68]  
## 4                24.59         [24.59]  
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 153 entries, 0 to 152
## Data columns (total 6 columns):
## Ozone      153 non-null float64
## Solar.R    146 non-null float64
## Wind       153 non-null float64
## Temp       153 non-null int64
## Month      153 non-null int64
## Day        153 non-null int64
## dtypes: float64(3), int64(3)
## memory usage: 7.2 KB
## None

Chapter 5 - Case Study

Putting it all together - Gapminder data (NPO supporting global sustainable development):

  • Dataset will be life expectancy by country and year
  • Goal is to clean and combine all of the data so there is a single file ready for further data analysis

Initial impressions of the data - depending on the analysis needs, can melt (columns to rows) or pivot (new columns from column data) the data:

  • Can check the column types by using df.dftypes
  • Can change column types using .to_numeric() or .astype()
  • Can save a CSV using df.to_csv(myFile)

Example code includes:


myPath = "./PythonInputFiles/"

# The DataFrame g1800s is a life expectancy table of 260 x 101 - "Life Expectancy" (country) followed by "1800" through "1899"
# Copied data from https://docs.google.com/spreadsheets/d/1H3nzTwbn8z4lJ5gJ_WfDgCeGEXK3PVGcNjQ_U5og8eo/pub as accessed from http://www.gapminder.org/data/ to myPath + "gapminder_lifeExp_1800_1916.xlsx"

import pandas as pd
gapExcel = pd.read_excel(myPath + "gapminder_lifeExp_1800_1916.xlsx")


# Convert column labels to text
gapExcel.columns = gapExcel.columns.astype(str)
assert gapExcel.columns[0] == "Life expectancy"

# Create booleans for 1800s, 1900s, and 2000s, including "Life expectancy" (country columns) as true in all
col1800s = gapExcel.columns.str.startswith("18")
col1900s = gapExcel.columns.str.startswith("19")
col2000s = gapExcel.columns.str.startswith("20")
col1800s[0] = True
col1900s[0] = True
col2000s[0] = True

# Create g1800s, g1900s, g2000s
g1800s = gapExcel.loc[:, col1800s]
g1900s = gapExcel.loc[:, col1900s]
g2000s = gapExcel.loc[:, col2000s]


# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Create the scatter plot
g1800s.plot(kind="scatter", x="1800", y="1899")

# Specify axis labels
plt.xlabel('Life Expectancy by Country in 1800')
plt.ylabel('Life Expectancy by Country in 1899')

# Specify axis limits
plt.xlim(20, 55)
plt.ylim(20, 55)

# Display the plot
# plt.show()
plt.savefig("_dummyPy048.png", bbox_inches="tight")
plt.clf()


import pandas as pd
import numpy as np


def check_null_or_valid(row_data):
    """Function that takes a row of data,
    drops all missing values,
    and checks if all remaining values are greater than or equal to 0
    """
    no_na = row_data.dropna()[1:-1]
    numeric = pd.to_numeric(no_na)
    ge0 = numeric >= 0
    return ge0

# Check whether the first column is 'Life expectancy'
assert g1800s.columns[0] == "Life expectancy"

# Check whether the values in the row are valid
assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()

# Check that there is only one instance of each country
assert g1800s['Life expectancy'].value_counts()[0] == 1


# Also frames g1900s as 260x101 and g2000s as 260x18
# Concatenate the DataFrames row-wise
gapminder = pd.concat([g1800s, g1900s, g2000s])

# Print the shape of gapminder
print(gapminder.shape)

# Print the head of gapminder
print(gapminder.head())


# Melt gapminder: gapminder_melt
gapminder_melt = pd.melt(gapminder, id_vars="Life expectancy")

# Rename the columns
gapminder_melt.columns = ['country', 'year', 'life_expectancy']

# Print the head of gapminder_melt
print(gapminder_melt.head())


# Exercises used gapminder_melt as gapminder - keep copy before over-writing in case needed later
gapminder_old = gapminder.loc[:, :]
gapminder = gapminder_melt.loc[:, :]


# Convert the year column to numeric
gapminder.year = pd.to_numeric(gapminder.year)

# Test if country is of type object
assert gapminder.country.dtypes == np.object

# Test if year is of type int64
assert gapminder.year.dtypes == np.int64

# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64


# Create the series of countries: countries
countries = gapminder["country"]

# Drop all the duplicates from countries
countries = countries.drop_duplicates()

# Write the regular expression: pattern
pattern = '^[A-Za-z\.\s]*$'

# Create the Boolean vector: mask
mask = countries.str.contains(pattern)

# Invert the mask: mask_inverse
mask_inverse = ~mask  # The ~ is for inversion

# Subset countries using mask_inverse: invalid_countries
invalid_countries = countries.loc[mask_inverse]

# Print invalid_countries
print(invalid_countries)


# Assert that country does not contain any missing values
assert pd.notnull(gapminder.country).all()

# Assert that year does not contain any missing values
assert pd.notnull(gapminder.year).all()

# Print the shape of gapminder (prior to dropping NaN)
print(gapminder.shape)

# Drop the missing values
gapminder = gapminder.dropna()

# Print the shape of gapminder (after dropping NaN)
print(gapminder.shape)


# Add first subplot
plt.subplot(2, 1, 1) 

# Create a histogram of life_expectancy
gapminder["life_expectancy"].plot(kind="hist")

# Group gapminder: gapminder_agg
gapminder_agg = gapminder.groupby(by="year")["life_expectancy"].mean()

# Print the head of gapminder_agg
print(gapminder_agg.head())

# Print the tail of gapminder_agg
print(gapminder_agg.tail())


# Add second subplot
plt.subplot(2, 1, 2)

# Create a line plot of life expectancy per year
gapminder_agg.plot()

# Add title and specify axis labels
plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')

# Display the plots
plt.tight_layout()
# plt.show()
plt.savefig("_dummyPy049.png", bbox_inches="tight")
plt.clf()

# Save both DataFrames to csv files
gapminder.to_csv(myPath + "gapminder.csv")
gapminder_agg.to_csv(myPath + "gapminder_agg.csv")
## (780, 218)
##     1800   1801   1802   1803   1804   1805   1806   1807   1808   1809  \
## 0    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
## 1  28.21  28.20  28.19  28.18  28.17  28.16  28.15  28.14  28.13  28.12   
## 2    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
## 3  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40   
## 4  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82   
## 
##            ...            2008  2009  2010  2011  2012  2013  2014  2015  \
## 0          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
## 1          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
## 2          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
## 3          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
## 4          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
## 
##    2016        Life expectancy  
## 0   NaN               Abkhazia  
## 1   NaN            Afghanistan  
## 2   NaN  Akrotiri and Dhekelia  
## 3   NaN                Albania  
## 4   NaN                Algeria  
## 
## [5 rows x 218 columns]
##                  country  year  life_expectancy
## 0               Abkhazia  1800              NaN
## 1            Afghanistan  1800            28.21
## 2  Akrotiri and Dhekelia  1800              NaN
## 3                Albania  1800            35.40
## 4                Algeria  1800            28.82
## 49           Congo, Dem. Rep.
## 50                Congo, Rep.
## 53              Cote d'Ivoire
## 73     Falkland Is (Malvinas)
## 93              Guinea-Bissau
## 98           Hong Kong, China
## 118     United Korea (former)
## 131              Macao, China
## 132            Macedonia, FYR
## 145     Micronesia, Fed. Sts.
## 161           Ngorno-Karabakh
## 187            St. Barthélemy
## 193    St.-Pierre-et-Miquelon
## 225               Timor-Leste
## 251     Virgin Islands (U.S.)
## 252      North Yemen (former)
## 253      South Yemen (former)
## 258                     Åland
## Name: country, dtype: object
## (169260, 3)
## (43857, 3)
## year
## 1800    31.486020
## 1801    31.448905
## 1802    31.463483
## 1803    31.377413
## 1804    31.446318
## Name: life_expectancy, dtype: float64
## year
## 2012    71.663077
## 2013    71.916106
## 2014    72.088125
## 2015    72.321010
## 2016    72.556635
## Name: life_expectancy, dtype: float64

Gapminder Life Expectancy by Country (1899 vs 1800):

Gapminder Life Expectancy:

Python Data Manipulation

pandas Foundations

Chapter 1 - Data Ingestion and Inspection

Review of pandas data frames - tabular data structure with labelled rows and columns:

  • Rows have an index - tabled list of labels
  • Can get the columns as a list (technically, pandas index) using myPD.columns
  • Can get the rows as a list (technically, pandas index) using myPD.index
  • Can filter using numeric indices using myPD.iloc[row, col] # all row or all col is signalled with : and from start end at a-1 is :a and from a to end is a:
    • The .loc accesser will instead access by way of indices
  • Can see the first few rows using myPD.head() and can see the last few rows using myPD.tail() # put a number inside () if you do not want the default of 6 [indices 0-5]
  • Can get similar information to str() when using myPD.info()
  • Can use broadcasting with the :: operator - for example, myPD.iloc[::3, -1] will access every third row and the last column
  • The columns of a data frame are called a “series”, has its own .head() method, and inherits its name from the master pandas data frame

Building DataFrames from scratch:

  • Can load from flat files or other external data sources, such as pd.read_csv()
  • Can create from dictionaries (associative arrays) - the keys become the column names while the values (lists) become the column contents
    • pd.DataFrame(myDict) will run the conversion, with row indices starting from 0 and running through n-1 created by default
  • Can create from zipped tuples of lists - assume that lists a, b, and c have already been created and are of the same length
    • list_labels = [“a”, “b”, “c”] ; list_data = [a, b, c] ; zip_list = list(zip(list_labels, list_data))
    • pd.DataFrame(dict(zip_list)) will then create the pandas DataFrame by way of the dictionary
  • New columns can be created on the fly (boradcasting), such as myPD[“newCol”] = 0 # will put 0 in every row of newCol
    • Broadcasting can also be done with the dictionary method, where a single value in a key-value pair will be broadcast to all rows of the DataFrame

Importing and exporting data - example using ISSN_D_tot.csv, sunspot data:

  • Can read in the CSV using pd.read_csv(“myCSV.csv”)
    • Appliyng the option header=None will work better for data where the first row does not contain the column labels
    • Can also provide the option names=[myList] to assign myList as the column names
    • Can also provide the na_values= option to assign NA; for example, na_values=" -1" if all the space followed by -1 are supposed to mean missing values
    • Can also provide a dictionary by column names for the NA strings, such as {“sunspots”:[" -1“]} to indicate that the sunspots data column in the CSV uses " -1" for NA
    • Can also provide the option parse_dates([myList]) and the reader will do its best to take data in columns myList and amalgamate them to a date
  • Can keep only the desired columns of a pandas DataFrame by using df[myCols] where myCols is a list of columns desired to be kept
  • Can write the DataFrame to a CSV using df.to_csv() # Can make other flat files using sep=“”, for example tab-delimited would be sep=“”
  • Can write the DataFrame to Excel using df.to_excel()

Plotting with pandas - can plot either the panda Series or the underlying numpy array - plt.plot() followed by plt.show() works on either/both:

  • myPD[“myCol”].values will be the numpy array for column myCol
  • myPD[“myCol”] will be the pandas Series for column myCol
  • Alternately, the pandas Series has a .plot() method, so myPD[“myCol”].plot() rather than plt.plot(myPD[“myCol”]) can be used
    • Can also apply the .plot() method to the full pandas DataFrame, such as myPD.plot()
  • Can apply plt.yscale(“log”) to create a log-scale for the y-axis
  • Some additional options to .plot() include color=, style=, legend= # colors are “r”, “b” and the like while styles are " ." and " .-" and the like
  • Can save plots as various formats, inferred by the extension of the plt.savefig() call
    • PNG plt.savefig(“myFile.png”)
    • JPG plt.savefig(“myFile.jpg”)
    • PDF plt.savefig(“myFile.pdf”)

Example code includes:


myPath = "./PythonInputFiles/"

# NEED TO CREATE FRAME df - "Total Population" - [3034970564.0, 3684822701.0, 4436590356.0, 5282715991.0, 6115974486.0, 6924282937.0] indexed by "Year" [1960, 1970, 1980, 1990, 2000, 2010]
# Import numpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


df = pd.DataFrame( {"Total Population":[3034970564.0, 3684822701.0, 4436590356.0, 5282715991.0, 6115974486.0, 6924282937.0], "Year":[1960, 1970, 1980, 1990, 2000, 2010]} )
df.index = df["Year"]
del df["Year"]
world_population = df.copy()

# Create array of DataFrame values: np_vals
np_vals = df.values

# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)

# Create array of new DataFrame by passing df to np.log10(): df_log10
df_log10 = np.log10(df)

# Print original and new data containers
print(type(np_vals), type(np_vals_log10))
print(type(df), type(df_log10))


list_keys = ['Country', 'Total']
list_values = [['United States', 'Soviet Union', 'United Kingdom'], [1118, 473, 273]]

# Zip the 2 lists together into one list of (key,value) tuples: zipped
zipped = list(zip(list_keys, list_values))

# Inspect the list using print()
print(zipped)

# Build a dictionary with the zipped list: data
data = dict(zipped)

# Build and inspect a DataFrame from the dictionary: df
df = pd.DataFrame(data)
print(df)


tempDict = {"a":[1980, 1981, 1982] , "b":["Blondie", "Chris Cross", "Joan Jett"] , "c":["Call Me", "Arthurs Theme", "I Love Rock and Roll"], "d":[6, 3, 7]}
df = pd.DataFrame(tempDict)

# Build a list of labels: list_labels
list_labels = ['year', 'artist', 'song', 'chart weeks']

# Assign the list of labels to the columns attribute: df.columns
df.columns = list_labels
print(df)


cities = ['Manheim', 'Preston park', 'Biglerville', 'Indiana', 'Curwensville', 'Crown', 'Harveys lake', 'Mineral springs', 'Cassville', 'Hannastown', 'Saltsburg', 'Tunkhannock', 'Pittsburgh', 'Lemasters', 'Great bend']

# Make a string with the value 'PA': state
state = "PA"

# Construct a dictionary: data
data = {'state':state, 'city':cities}

# Construct a DataFrame from dictionary data: df
df = pd.DataFrame(data)

# Print the DataFrame
print(df)


# "world_population.csv is the same 6x2 population data as per the above
# Read in the file: df1
# df1 = pd.read_csv("world_population.csv")
# Skipped this part

# Create a list of the new column labels: new_labels
# new_labels = ["year", "population"]

# Read in the file, specifying the header and names parameters: df2
# df2 = pd.read_csv('world_population.csv', header=0, names=new_labels)
# Skipped this step

# Print both the DataFrames
# print(df1)
# print(df2)


# DO NOT HAVE the messy data - file_messy is "messy_stock_data.tsv"
# Read the raw file as-is: df1
# df1 = pd.read_csv(file_messy)

# Print the output of df1.head()
# print(df1.head())

# Read in the file with the correct parameters: df2
# df2 = pd.read_csv(file_messy, delimiter="\t", header=3, comment="#")

# Print the output of df2.head()
# print(df2.head())

# Save the cleaned up DataFrame to a CSV file without the index
# df2.to_csv(file_clean, index=False)

# Save the cleaned up DataFrame to an excel file without the index
# df2.to_excel('file_clean.xlsx', index=False)



# DO NOT HAVE DataFrame df, which is a 744x1 of "Temperature (deg F)" indexed automatically as 0-743
# Downloaded raw METAR data for KAUS using 0801100000 UTC - 0831102359 UTC
# Coded to a cleaned CSV as per below
# 
# 
# metarList = []
# for line in open(myPath + "KAUS_Metar_Aug2010.txt", "r"): metarList.append(line.rstrip())
# cleanMetar = []
# cleanLine = ""
# for recs in metarList:
#     if recs.startswith("#") or recs == "" : continue
#     if recs.startswith("2") : 
#         if cleanLine != "" : 
#             cleanMetar.append(cleanLine)
#         cleanLine = recs
#     else:
#         cleanLine = cleanLine + " " + recs.strip()
# 
# cleanMetar.append(cleanLine)
# 
# useMetar = [textBlock for textBlock in cleanMetar if "METAR" in textBlock]
# useSpeci = [textBlock for textBlock in cleanMetar if "SPECI" in textBlock]
# assert len(cleanMetar) == len(useMetar) + len(useSpeci)
# 
# import re
# 
# metTime = []
# tempF = []
# dewF = []
# altMG = []
# 
# for textBlock in useMetar:
#     if textBlock.endswith("NIL="):
#         print("Not using line", textBlock)
#         continue
#     
#     # print(textBlock)
#     dateUTC = textBlock.split()[0]
#     
#     tempData = re.findall("T([0-9][0-9][0-9][0-9])([0-9][0-9][0-9][0-9])", textBlock)
#     assert len(tempData) == 1
#     a, b = tempData[0]
#     tempC = float(a[1:])/10
#     dewC = float(b[1:])/10
#     if a[0] == "1" : tempC = -tempC
#     if b[0] == "1" : dewC = -dewC
#     
#     tF = round((9/5) * tempC + 32, 0)
#     dF = round((9/5) * dewC + 32, 0)
#     
#     altData = re.findall("A([0-9][0-9][0-9][0-9])", textBlock)
#     assert len(altData) == 1
#     
#     aMG = float(altData[0]) / 100
#     # print(dateUTC, tempC, dewC, altMG, tempF, dewF)
#     
#     metTime.append(dateUTC)
#     tempF.append(tF)
#     dewF.append(dF)
#     altMG.append(aMG)
# 
# metarKAUS = pd.DataFrame( {"DateTime (UTC)":metTime, "Temperature (deg F)":tempF , "Dew Point (deg F)":dewF, "Pressure (atm)":altMG} )
# metarKAUS.index = metarKAUS["DateTime (UTC)"]
# del metarKAUS["DateTime (UTC)"]
# 
# metarKAUS.to_csv(myPath + "KAUS_Metar_Aug2010_Clean.csv")


# Create or import the data
# import random
# df = pd.DataFrame( {"Temperature (deg F)":np.random.randint(low=60, high=100, size=744)} )
dfFull = pd.read_csv(myPath + "KAUS_Metar_Aug2010_Clean.csv")
df = dfFull.loc[:, "Temperature (deg F)"]

# Create a plot with color='red'
df.plot(color="red")

# Add a title
plt.title('Temperature in Austin')

# Specify the x-axis label
plt.xlabel('Hours since midnight August 1, 2010')

# Specify the y-axis label
plt.ylabel('Temperature (degrees F)')

# Display the plot
# plt.show()
plt.savefig("_dummyPy050.png", bbox_inches="tight")
plt.clf()


# DO NOT HAVE DataFrame df, which is a 744x3 of "Temperature (deg F)", "Dew Point (deg F)", "Pressure (atm)" indexed automatically as 0-743
# df["Dew Point (deg F)"] = df.iloc[:, 0] + np.random.randint(low=-30, high=0, size=744)
# df["Pressure (atm)"] = np.random.randint(low=980, high=1020, size=744)
# Use dfFull rather than manufacturing data

df = dfFull.copy()
df.index = [x[6:8] + "-" + "{0:0>2}".format(str(int(x[9:10]) + 1)) + "Z" for x in df["DateTime (UTC)"].astype(str)]
del df["DateTime (UTC)"]

# Plot all columns (default)
df.plot()
# plt.show()
plt.savefig("_dummyPy051.png", bbox_inches="tight")
plt.clf()


# Plot all columns as subplots
df.plot(subplots=True)
# plt.show()
plt.savefig("_dummyPy052.png", bbox_inches="tight")
plt.clf()


# Plot just the Dew Point data
column_list1 = ['Dew Point (deg F)']
df[column_list1].plot()
# plt.show()
plt.savefig("_dummyPy053.png", bbox_inches="tight")
plt.clf()


# Plot the Dew Point and Temperature data, but not the Pressure data
column_list2 = ['Temperature (deg F)','Dew Point (deg F)']
df[column_list2].plot()
# plt.show()
plt.savefig("_dummyPy054.png", bbox_inches="tight")
plt.clf()
## <class 'numpy.ndarray'> <class 'numpy.ndarray'>
## <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
## [('Country', ['United States', 'Soviet Union', 'United Kingdom']), ('Total', [1118, 473, 273])]
##           Country  Total
## 0   United States   1118
## 1    Soviet Union    473
## 2  United Kingdom    273
##    year       artist                  song  chart weeks
## 0  1980      Blondie               Call Me            6
## 1  1981  Chris Cross         Arthurs Theme            3
## 2  1982    Joan Jett  I Love Rock and Roll            7
##                city state
## 0           Manheim    PA
## 1      Preston park    PA
## 2       Biglerville    PA
## 3           Indiana    PA
## 4      Curwensville    PA
## 5             Crown    PA
## 6      Harveys lake    PA
## 7   Mineral springs    PA
## 8         Cassville    PA
## 9        Hannastown    PA
## 10        Saltsburg    PA
## 11      Tunkhannock    PA
## 12       Pittsburgh    PA
## 13        Lemasters    PA
## 14       Great bend    PA

Temperature - Austin, TX (Aug 2010):

METAR plots - Austin, TX (Aug 2010):

METAR Sub-plots - Austin, TX (Aug 2010):

Dew Point - Austin, TX (Aug 2010):

Temperature and Dew Point - Austin, TX (Aug 2010):


Chapter 2 - Exploratory Data Analysis

Visual exploratory data analysis - using Fisher’s iris flower data (similar to the R dataset):

  • Can use df.plot(x=“quotedVar1”, y=“quotedVar2”, kind=“scatter”) followed by plt.show() for general DataFrame plotting
    • The default is kind=“line”, though kind=“scatter” often makes more sense for unordered and/or multi-dimensional data
    • Can add plt.xlabel() and plt.ylabel() for labelling the axis dimensions
    • Can also have types like kind=“box” for box/whiskers, kind=“hist” for histograms, etc.
    • Further, can specify any matplotlib options inside DataFrame.plot() command - see the documentation
  • For histograms, cumulative=True will make the CDF rather than PDF while normed=True makes it probabilities rather than total counts
  • There are several manners (with slightly different defaults) for calling plots on a dataframe - df.plot(kind=“hist”), df.plt.hist(), and df.hist()

Statistical exploratory data analysis - starting with the .describe() method which is very similar to summary() in R - counts, means, quartiles, and the like:

  • These can be accessed individually, such as .count(), .mean(), .std(), .median(), .quantile(q) where q is between 0 and 1 and can be a list or array of values, .max(), .min()
    • All of these statistics AVOID the null entries - the count is the count of non-null, the mean is the mean of the non-null, etc.

Separating populations with boolean indexing - subsets of columns and/or rows for plotting, summarizing, and the like:

  • The .unique() method returns the unique factors of a categorical variable, suggesting subsets of interest for EDA
  • The typical filtering process would be to create a boolean, then myFilter = myDF[myBool, :]

Example code includes:


myPath = "./PythonInputFiles/"


import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt


dummyStock = pd.read_csv(myPath + "StockChart_20170615.csv", header=None)
dummyStock.columns = ["Symbol", "Data"]
# Data is a single space-delimited string of Date - Open - High - Low - Close - Volume

dummyStockSplit = dummyStock["Data"].str.split()
dummyDates = [datetime.strptime(x[0], "%m/%d/%Y") for x in dummyStockSplit]
dummyClose = [float(x[4]) for x in dummyStockSplit]

dfStock = pd.DataFrame( {"date":dummyDates, "symbol":dummyStock["Symbol"] , "close":dummyClose} )
df = dfStock.pivot(index="date", columns="symbol", values="close").resample("M").max()


# df is 12 x 4 with columns Month-AAPL-GOOG-IBM
# Create a list of y-axis column names: y_columns
y_columns = ["AAPL", "IBM"]

# Generate a line plot
df.plot(y=y_columns)

# Add the title
plt.title('Monthly stock prices')

# Add the y-axis label
plt.ylabel('Price ($US)')

# Display the plot
# plt.show()
plt.savefig("_dummyPy055.png", bbox_inches="tight")
plt.clf()


# Here, df appears to be the mtcars data
# Saved file from R
df = pd.read_csv(myPath + "mtcars.csv", index_col=0)

# sizes is a pre-defined np.array(), not sure of what
sizes = df["cyl"]
# Generate a scatter plot
df.plot(kind="scatter", x='hp', y='mpg', s=5*(sizes-3))

# Add the title
plt.title('Fuel efficiency vs Horse-power')

# Add the x-axis label
plt.xlabel('Horse-power')

# Add the y-axis label
plt.ylabel('Fuel efficiency (mpg)')

# Display the plot
# plt.show()
plt.savefig("_dummyPy056.png", bbox_inches="tight")
plt.clf()


# Make a list of the column names to be plotted: cols
cols = ["wt", "mpg"]

# Generate the box plots
df[cols].plot(kind="box", subplots=True)

# Display the plot
# plt.show()
plt.savefig("_dummyPy057.png", bbox_inches="tight")
plt.clf()


# Here, df is the tipping data from the Seaborn package, with emphasis on the column "fraction"
# Create a reasonable analog based on the pre-made CSV
tips = pd.read_csv(myPath + "tips.csv")
tips.sex = tips["sex"].astype("category")
tips.smoker = tips["smoker"].astype("category")
tips['total_bill'] = pd.to_numeric(tips["total_bill"], errors="coerce")
tips['tip'] = pd.to_numeric(tips["tip"], errors="coerce")
tips["fraction"] = tips["tip"] / tips["total_bill"]
df = tips.copy()


# This formats the plots such that they appear on separate rows
fig, axes = plt.subplots(nrows=2, ncols=1)

# Plot the PDF and CDF on the two axes
df.fraction.plot(ax=axes[0], kind='hist', bins=30, normed=True, range=(0,.3))
df.fraction.plot(ax=axes[1], kind="hist", bins=30, normed=True, cumulative=True, range=(0,.3))
# plt.show()
plt.savefig("_dummyPy058.png", bbox_inches="tight")
plt.clf()


# df is degrees by gender from http://nces.ed.gov/programs/digest/2013menu_tables.asp
# DO NOT HAVE DATASET - skip
# Print the minimum value of the Engineering column
# print(df["Engineering"].min())

# Print the maximum value of the Engineering column
# print(df["Engineering"].max())

# Construct the mean percentage per year: mean
# mean = df.mean(axis="columns")

# Plot the average percentage per year
# mean.plot()

# Display the plot
# plt.show()


# Now, df appears to be the Titanic dataset (not the table)
df = pd.read_csv(myPath + "titanic.csv")

# Print summary statistics of the fare column with .describe()
print(df["Fare"].describe())

# Generate a box plot of the fare column
df["Fare"].plot(kind="box")

# Show the plot
# plt.show()
plt.savefig("_dummyPy059.png", bbox_inches="tight")
plt.clf()


# Now, df is the life-expectancy Gapminder data as 260x219
# Needs the encoding to load
df = pd.read_csv(myPath + "gapminder.csv", encoding="latin-1", index_col=0).pivot_table(index="country", columns="year", values="life_expectancy")

# Print the number of countries reported in 2015
print(df[2015].count())

# Print the 5th and 95th percentiles
print(df.quantile([0.05, 0.95]))

# Generate a box plot
years = [1800, 1850, 1900, 1950, 2000]
df[years].plot(kind='box')
# plt.show()
plt.savefig("_dummyPy060.png", bbox_inches="tight")
plt.clf()


# Now, df is Pittsburgh weather data from https://www.wunderground.com/history/
# NEED TO GET THIS DATA
# january and march are both 31x2 with the columns being Date-Temperature
df = pd.read_csv(myPath + "KPIT_Temps_Small.csv")

january = df[["Date", "jan"]]
march = df[["Date", "mar"]]

# Print the mean of the January and March data
print(january.mean(), "\n", march.mean())

# Print the standard deviation of the January and March data
print(january.std(), "\n", march.std())


# Here, df is again automobile data of shape (392, 9)
# NEED TO GET THIS DATA - using MASS::Cars93 instead
tempDF = pd.read_csv(myPath + "Cars93.csv")
tempDF["Origin"]
df = tempDF[["Origin", "MPG.city", "MPG.highway", "Weight", "Horsepower"]]


# Compute the global mean and global standard deviation: global_mean, global_std
global_mean = df.mean()
global_std = df.std()

# Filter the US population from the origin column: us
us = df.loc[df["Origin"] == "USA", :]

# Compute the US mean and US standard deviation: us_mean, us_std
us_mean = us.mean()
us_std = us.std()

# Print the differences
print(us_mean - global_mean)
print(us_std - global_std)


# titanic is 1309x14 of data from the titanic
titanic = pd.read_csv(myPath + "titanic.csv", index_col=0)


# Display the box plots on 3 separate rows and 1 column
fig, axes = plt.subplots(nrows=3, ncols=1)

# Generate a box plot of the fare prices for the First passenger class
titanic.loc[titanic['Pclass'] == 1].plot(ax=axes[0], y='Fare', kind='box')

# Generate a box plot of the fare prices for the Second passenger class
titanic.loc[titanic['Pclass'] == 2].plot(ax=axes[1], y='Fare', kind='box')

# Generate a box plot of the fare prices for the Third passenger class
titanic.loc[titanic['Pclass'] == 3].plot(ax=axes[2], y='Fare', kind='box')

# Display the plot
# plt.show()
plt.savefig("_dummyPy061.png", bbox_inches="tight")
plt.clf()
## count    891.000000
## mean      32.204208
## std       49.693429
## min        0.000000
## 25%        7.910400
## 50%       14.454200
## 75%       31.000000
## max      512.329200
## Name: Fare, dtype: float64
## 208
## year   1800   1801   1802  1803  1804   1805   1806   1807  1808  1809  \
## 0.05  25.40  25.30  25.20  25.2  25.2  25.40  25.40  25.40  25.3  25.3   
## 0.95  37.92  37.35  38.37  38.0  38.3  38.37  38.37  38.37  38.0  38.0   
## 
## year   ...      2007   2008    2009    2010   2011    2012    2013   2014  \
## 0.05   ...     53.07  53.60  54.235  54.935  55.97  56.335  56.705  56.87   
## 0.95   ...     80.73  80.93  81.200  81.365  81.60  81.665  81.830  82.00   
## 
## year    2015     2016  
## 0.05  57.855  59.2555  
## 0.95  82.100  82.1650  
## 
## [2 rows x 217 columns]
## Date    16.000000
## jan     26.096774
## dtype: float64 
##  Date    16.000000
## mar     43.612903
## dtype: float64
## Date     9.092121
## jan     10.514608
## dtype: float64 
##  Date    9.092121
## mar     8.503636
## dtype: float64
## MPG.city        -1.407258
## MPG.highway     -0.940188
## Weight         122.409274
## Horsepower       3.692876
## dtype: float64
## MPG.city       -1.625356
## MPG.highway    -1.180389
## Weight        -24.668815
## Horsepower      2.080330
## dtype: float64

Maximum Stock Price by Month:

MPG vs HP (sized by Cylinders):

Box Plots for Weight and MPG (mtcars):

PDF and CDF for Tip as Percentage of Total Bill:

Box Plots for Titanic Fares:

Box Plot for Life Expectancy by Country (Gapminder):

Titanic Fares by Class (First, Second, Third):


Chapter 3 - Time series in pandas

Indexing pandas time series - dates and times are stored in datetime options:

  • When reading from a CSV, the command parse_dates=True will convert the relevant column(s) to ISO-8601 formats (yyyy-mm-dd hh:mm:ss)
  • The index_col=“myDateFieldFromCSV” option in a pd.read_csv() will set the relevant date column (assuming parse_dates=True) as the datetime index for the DataFrame
  • Assuming that a DataFrame is indexed by datetime, can pass a smaller string (e.g., 2012-2 rather than 2012-2-5 11:00:00) to df.loc[] and everything that matches all of the smaller string will be extracted
  • Pandas supports partial datetime string selection, and using many input formats
    • df.loc[“February 5, 2015”] or df.loc[“2015-Feb-5”] or df.loc[“2015”
  • Can also slice a datetime string, such as df.loc[“2015-Jan”:“2015-Mar”] to get the entire Q1 2015 data
  • Can convert objects to datetime using pd.to_datetime()
  • Can reindex the data using df.reindex(myTime, method=)
    • The default method is to fill with np.nan, though can specify “ffill” or “bfill” to fill forwards or backwards

Resampling pandas time series - taking statistical measures over different time intervals:

  • Downsampling is the process of reducing datetime rows to slower frequency (e.g., hourly to daily)
    • df.resample(“D”).mean() will take the mean of the down-sampled data, with “D” meaning “daily”
    • df.resample(“W”).mean() will take the mean of weekly data
    • Can build longer chains where needed; for example, df.resample(“D”).sum().max() will be the maximum daily sum
    • “min” or “T” is minute; “H” is hourly; “D” is daily; “B” is busines daily
    • “W” is week; “M” is month; “Q” is quarter; “A” is year (annual)
    • Can further using interval multiples, for example “3M” would be 3-monthly (essentially, quarterly)
    • There are certain default to things like “#W”, for example, aligning weekly data to report by Sundays
  • Upsampling is the process of increasing datetime rows to faster frequency (e.g., daily to hourly)
    • A common upsampling approach would use “ffill” or “bfill” such as df.resample(“4H”).ffill() - interpolation

Manipulating pandas time series - changing the data in one or more columns:

  • Can apply the string methods such as df[“myCol”].str.upper() - note that this is NOT a transformation in place but rather a new series
  • Can apply the string method .contains() to search for a partial string, such as df[“myCol”].str.contains(“ello”) - will return a boolean of the same length
  • Can access the .dt method (datetime method) and its features, such as df[“myCol”].dt.hour - will extract the hour
    • The .dt.tz_localize(“US/Central”) will convert everything to US Central Time
    • The .dt.tz_convert(“US/Eastern”) will convert everything to US Eastern Time
  • Can also chain these, such as .dt.tz_localize(“US/Central”).dt.tz_convert(“US/Eastern”) - note that the second .dt is needed after the tz_localize, since that returned a new series and not a .dt
  • To run a linear interpolation, use df.resample(“Y”).first().interpolate(“linear”) - will do linear interpolation between the values that already exist

Visualizing pandas time series - additional plotting techniques such line types, plot types, and sub-plots:

  • Using daily S&P 500 date from 2010-01-01 through 2015-12-31 - Open-High-Low-Close-Volume
  • When plotting, can run df.plot(title=) to set the title, and plt.ylabel() later to set the y-axis labels
  • By default, df.plot will use a blue line provided that df is a pandas DataFrame
  • Can pass “MATLAB-like style strings” to the .plot(style=) options for other than lines
    • The string has 3 characters; color (“k” is black), market (“.” is dot), and line type (“-”, or hyphen, is solid) - so “k.-” means black, solid line with a dot marker
    • Colors - “b” for blue, “g” for green, “r” for red“,”c" for cyan
    • Markers - “o” for circle, “*" for star, “s” for square, “+” for plus
    • Line - “:” for dotted, “–” for dashed
  • Can also pass an argument to the .plot(kind=) to instead have “hist” or “area”
  • Can also pass the argument .plot(subplots=True) to have sub-plots created (on separate scales) for each of the data series

Example code includes:


myPath = "./PythonInputFiles/"


import pandas as pd
import matplotlib.pyplot as plt


# GREAT data is available at https://mesonet.agron.iastate.edu/request/download.phtml?network=IL_ASOS
# Downloaded KORD data from 2010 to myPath + "KORD_2010_from_IAState.txt"
# First 5 rows are commented, the sixth row is the header, and the next 10,443 rows are the data

# Load the file
tmpORD = pd.read_csv(myPath + "KORD_2010_from_IAState.txt", header=5)
tmpORD.columns = tmpORD.columns.str.strip()
isMETAR = tmpORD.loc[:, "valid"].str.contains(":51")  # KORD METAR are taken at xx:51
useORD = tmpORD.loc[isMETAR, :]  # ends as 8709 x 22, probably the METAR check missed a few at "off" times

date_list = useORD["valid"]
temperature_list = list(useORD["tmpf"])

# This is 8,759 temperature observations refelecting 20100101 00:00 through 20101231 23:00 on an hourly basis
# Prepare a format string: time_format
time_format = '%Y-%m-%d %H:%M'

# Convert date_list into a datetime object: my_datetimes
my_datetimes = pd.to_datetime(date_list, format=time_format)  

# Construct a pandas Series using temperature_list and my_datetimes: time_series
# Something to explore later - this produced all np.nan if temperature_list were already a Series
ts0 = pd.Series(temperature_list, index=my_datetimes)

# Extract the hour from 9pm to 10pm on '2010-10-11': ts1
ts1 = ts0.loc['2010-10-11 20:51:00']

# Extract '2010-07-04' from ts0: ts2
ts2 = ts0.loc["2010-07-04"]

# Extract data from '2010-12-15' to '2010-12-31': ts3
ts3 = ts0.loc["2010-12-15":"2010-12-31"]


# Reindex without fill method: ts3
ts3 = ts2.reindex(ts0.index)

# Reindex with fill method, using forward fill: ts4
ts4 = ts2.reindex(ts0.index, method="ffill")

# Combine ts1 + ts2: sum12
sum12 = ts1 + ts2

# Combine ts1 + ts3: sum13
sum13 = ts1 + ts3

# Combine ts1 + ts4: sum14
sum14 = ts1 + ts4


# Still working with the temperature data, now renamed as df [technically, same index but containing Temperature-Dew Point-Pressure]
df = useORD[["tmpf", "dwpf", "alti"]]
df.index = my_datetimes
df.columns = ["Temperature", "DewPoint", "Pressure"]
saveWeather = df.copy()


# Downsample to 6 hour data and aggregate by mean: df1
df1 = df["Temperature"].resample("6H").mean()

# Downsample to daily data and count the number of data points: df2
df2 = df["Temperature"].resample("D").count()


# Extract temperature data for August: august
august = df.loc["2010-08", "Temperature"]

# Downsample to obtain only the daily highest temperatures in August: august_highs
august_highs = august.resample("D").max()

# Extract temperature data for February: february
february = df.loc["2010-02", "Temperature"]

# Downsample to obtain the daily lowest temperatures in February: february_lows
february_lows = february.resample("D").min()


# Extract data from 2010-Aug-01 to 2010-Aug-15: unsmoothed
unsmoothed = df['Temperature']["2010-08-01":"2010-08-15"]

# Apply a rolling mean with a 24 hour window: smoothed
smoothed = unsmoothed.rolling(window=24).mean()

# Create a new DataFrame with columns smoothed and unsmoothed: august
august = pd.DataFrame({'smoothed':smoothed, 'unsmoothed':unsmoothed})

# Plot both smoothed and unsmoothed data using august.plot().
august.plot()
# plt.show()
plt.savefig("_dummyPy062.png", bbox_inches="tight")
plt.clf()


# Extract the August 2010 data: august
august = df['Temperature']["2010-08"]

# Resample to daily data, aggregating by max: daily_highs
daily_highs = august.resample("D").max()

# Use a rolling 7-day window with method chaining to smooth the daily high temperatures in August
daily_highs_smoothed = daily_highs.rolling(window=7).mean()
print(daily_highs_smoothed)



# Plot the summer data
df = saveWeather.copy()
df.Temperature["2010-Jun":"2010-Aug"].plot()
# plt.show()
plt.savefig("_dummyPy063.png", bbox_inches="tight")
plt.clf()

# Plot the one week data
df.Temperature['2010-06-10':'2010-06-17'].plot()
# plt.show()
plt.savefig("_dummyPy064.png", bbox_inches="tight")
plt.clf()



# Now, df is 1741x17 of airline/airport data
# Saved the June 2011 data from hflights::hflights to csv
dfJun = pd.read_csv(myPath + "junFlights.csv")
dfJun["useMonth"] = ["{0:0>2}".format(x) for x in dfJun["Month"]]
dfJun["useDate"] = ["{0:0>2}".format(x) for x in dfJun["DayofMonth"]]
keyDates = dfJun["Year"].astype(str) + dfJun["useMonth"] + dfJun["useDate"]
time_format = '%Y%m%d'
useDates = pd.to_datetime(keyDates, format=time_format)  
dfJun.index = useDates

df = dfJun[["DayOfWeek", "Dest", "DepTime", "ArrTime", "UniqueCarrier", "FlightNum"]]
df.columns = ["Weekday", "Destination Airport", "Wheels-off Time", "Arrival Time", "Carrier", "Flight"]

# Strip extra whitespace from the column names: df.columns
df.columns = df.columns.str.strip()

# Extract data for which the destination airport is Dallas: dallas
dallas = df['Destination Airport'].str.contains("DAL")

# Compute the total number of Dallas departures each day: daily_departures
daily_departures = dallas.resample("D").sum()

# Generate the summary statistics for daily Dallas departures: stats
stats = daily_departures.describe()
print(stats)


# Reset the index of ts2 to ts1, and then use linear interpolation to fill in the NaNs: ts2_interp
# ts2_interp = ts2.reindex(ts1.index).interpolate("linear")

# Compute the absolute difference of ts1 and ts2_interp: differences 
# differences = np.abs(ts2_interp - ts1)

# Generate and print summary statistics of the differences
# print(differences.describe())


# Buid a Boolean mask to filter out all the 'LAX' departure flights: mask
import numpy as np
mask = df['Destination Airport'] == "LAX"

# Use the mask to subset the data: la
la = df[mask].dropna()
la["Date"] = la.index.astype(str)
la["Wheel Time"] = ["{0:0>4}".format(int(x)) for x in la["Wheels-off Time"]]

# Combine two columns of data to create a datetime series: times_tz_none 
times_tz_none = pd.to_datetime(la["Date"] + " " + la["Wheel Time"])

# Localize the time to US/Central: times_tz_central
times_tz_central = times_tz_none.dt.tz_localize("US/Central")

# Convert the datetimes from US/Central to US/Pacific
times_tz_pacific = times_tz_central.dt.tz_convert("US/Pacific")


newDF = pd.DataFrame( {"Date":keyDates, "Carrier":list(df["Carrier"]), "nFlight":1} )
useCarrier = [x in ["XE", "CO", "WN", "OO"] for x in newDF["Carrier"]]
useDF = newDF.loc[useCarrier].pivot_table(index="Date", columns=["Carrier"], values=["nFlight"], aggfunc=sum)

# Plot the raw data before setting the datetime index
useDF.plot()
# plt.show()
plt.savefig("_dummyPy065.png", bbox_inches="tight")
plt.clf()


# Convert the 'Date' column into a collection of datetime objects: df.Date
useDF["Date"] = pd.to_datetime(useDF.index)

# Set the index to be the converted 'Date' column
useDF.set_index("Date", inplace=True)  # inplace=True makes the conversion in place; no need to reassign

# Re-plot the DataFrame to see that the axis is now datetime aware!
useDF.plot()
# plt.show()
plt.savefig("_dummyPy066.png", bbox_inches="tight")
plt.clf()
## valid
## 2010-08-01          NaN
## 2010-08-02          NaN
## 2010-08-03          NaN
## 2010-08-04          NaN
## 2010-08-05          NaN
## 2010-08-06          NaN
## 2010-08-07    83.094286
## 2010-08-08    83.402857
## 2010-08-09    84.122857
## 2010-08-10    84.560000
## 2010-08-11    85.434286
## 2010-08-12    86.591429
## 2010-08-13    88.160000
## 2010-08-14    88.880000
## 2010-08-15    88.288571
## 2010-08-16    87.157143
## 2010-08-17    85.588571
## 2010-08-18    84.585714
## 2010-08-19    84.020000
## 2010-08-20    84.020000
## 2010-08-21    83.711429
## 2010-08-22    83.428571
## 2010-08-23    83.145714
## 2010-08-24    83.865714
## 2010-08-25    83.300000
## 2010-08-26    82.014286
## 2010-08-27    81.165714
## 2010-08-28    81.602857
## 2010-08-29    83.454286
## 2010-08-30    84.868571
## 2010-08-31    86.437143
## Freq: D, Name: Temperature, dtype: float64
## count    30.00000
## mean     26.30000
## std       4.05267
## min      17.00000
## 25%      25.75000
## 50%      28.00000
## 75%      28.00000
## max      30.00000
## Name: Destination Airport, dtype: float64

Chicago Temperatures (KORD) - August 2010:

Chicago Temperatures (KORD) - Summer 2010:

Chicago Temperatures (KORD) - June 10-17, 2010:

Flights per Day (Top 4 Carriers) - Houston, June 2011:

Index Formatted as Date-Time rather than String:


Chapter 4 - Case Study - Sunlight in Austin

Reading and cleaning the data - messy weather and climate data for Austin:

  • First dataset will be climate normals for Austin from 1981-2010 (NOAA, hourly averages)
  • Second dataset will be climate measurements for Austin from 2011 - needs cleaning

Statistical exploratory data analysis - slicing time series and the like:

  • .describe() is like the summary() call in R
  • .mean(), .count(), .median() and the like are all available individually

Visual exploratory data analysis - histograms, line plots, box plots, and the like:

  • Pandas builds on matplotlib, allowing for further customization to make the plots pretty

Example code includes:


myPath = "./PythonInputFiles/"


# Import pandas
import pandas as pd

# GREAT data is available at https://mesonet.agron.iastate.edu/request/download.phtml?network=TX_ASOS
# Downloaded KORD data from 2011 to myPath + "KAUS_2011_from_IAState.txt"
tmpAUS = pd.read_csv(myPath + "KAUS_2011_from_IAState.txt", header=5)
tmpAUS.columns = tmpAUS.columns.str.strip()
isMETAR = tmpAUS.loc[:, "valid"].str.contains(":53")  # KAUS METAR are taken at xx:53
useAUS = tmpAUS.loc[isMETAR, :]  # ends as 11,352 x 22, tons of duplicate METAR
useAUS = useAUS.drop_duplicates(subset=["valid"])  # ends as 8,432 x 22, some days with as few as 15 records


# First 5 rows are commented, the sixth row is the header, and the next 10,443 rows are the data
# Read in the data file: df
# df = pd.read_csv("data.csv")
df = useAUS.copy()

df["date"] = [x.split()[0] for x in df["valid"]]
df["time"] = [x.split()[1] for x in df["valid"]]
df["StationType"] = "Airport"
df["sky_condition"] = df["skyc1"] + df["skyc2"] + df["skyc3"] + df["skyc4"]

# Print the output of df.head()
print(df.head())


# This is the column_labels list (my data is different - modify)
# column_labels = "Wban,date,Time,StationType,sky_condition,sky_conditionFlag,visibility,visibilityFlag,wx_and_obst_to_vision,wx_and_obst_to_visionFlag,dry_bulb_faren,dry_bulb_farenFlag,dry_bulb_cel,dry_bulb_celFlag,wet_bulb_faren,wet_bulb_farenFlag,wet_bulb_cel,wet_bulb_celFlag,dew_point_faren,dew_point_farenFlag,dew_point_cel,dew_point_celFlag,relative_humidity,relative_humidityFlag,wind_speed,wind_speedFlag,wind_direction,wind_directionFlag,value_for_wind_character,value_for_wind_characterFlag,station_pressure,station_pressureFlag,pressure_tendency,pressure_tendencyFlag,presschange,presschangeFlag,sea_level_pressure,sea_level_pressureFlag,record_type,hourly_precip,hourly_precipFlag,altimeter,altimeterFlag,junk"

# list_to_drop = ['sky_conditionFlag', 'visibilityFlag', 'wx_and_obst_to_vision', 'wx_and_obst_to_visionFlag', 'dry_bulb_farenFlag', 'dry_bulb_celFlag', 'wet_bulb_farenFlag', 'wet_bulb_celFlag', 'dew_point_farenFlag', 'dew_point_celFlag', 'relative_humidityFlag', 'wind_speedFlag', 'wind_directionFlag', 'value_for_wind_character', 'value_for_wind_characterFlag', 'station_pressureFlag', 'pressure_tendencyFlag', 'pressure_tendency', 'presschange', 'presschangeFlag', 'sea_level_pressureFlag', 'hourly_precip', 'hourly_precipFlag', 'altimeter', 'record_type', 'altimeterFlag', 'junk']

# Desired variables to be kept
# final_keep = ["Wban", "StationType", "date", "Time", "dry_bulb_faren", "dew_point_faren", "wet_bulb_faren", "dry_bulb_cel", "dew_point_cel", "wet_bulb_cel", "sky_condition", "station_pressure", "sea_level_pressure", "relative humidity", "wind_direction", "wind_speed", "visibility"]

final_keep = ["Wban", "StationType", "date", "Time", "dry_bulb_faren", "dew_point_faren", "sky_condition", "station_pressure", "sea_level_pressure", "relative humidity", "wind_direction", "wind_speed", "visibility"]

# Remove the appropriate columns: df_dropped
# df_dropped = df.drop(list_to_drop, axis="columns")
df_dropped = df.iloc[:, [0, 24, 22, 23, 2, 3, 25, 8, 9, 4, 5, 6, 10]]
df_dropped.columns = final_keep


# Print the output of df_dropped.head()
print(df_dropped.head())
print(df_dropped.shape)


# Convert the date column to string: df_dropped['date']
# df_dropped['date'] = df_dropped["date"].astype(str)

# Pad leading zeros to the Time column: df_dropped['Time']
# df_dropped['Time'] = df_dropped['Time'].apply(lambda x:'{:0>4}'.format(x))

# Concatenate the new date and Time columns: date_string
date_string = df_dropped['date'] + " " + df_dropped['Time']

# Convert the date_string Series to datetime: date_times
date_times = pd.to_datetime(date_string, format='%Y-%m-%d %H:%M')

# Set the index to be the new date_times container: df_clean
df_clean = df_dropped.set_index(date_times)


# Eliminate straggler record with index in 2010
is2011 = df_clean.index.year == 2011
df_clean = df_clean.loc[is2011, :]

# Print the output of df_clean.head()
print(df_clean.head())
print(df_clean.shape)


# Print the dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011
print(df_clean.loc["2011-06-20 08:00:00":"2011-06-20 09:00:00", "dry_bulb_faren"])

# Convert the dry_bulb_faren column to numeric values: df_clean['dry_bulb_faren']
df_clean['dry_bulb_faren'] = pd.to_numeric(df_clean['dry_bulb_faren'], errors="coerce")

# Print the transformed dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011
print(df_clean.loc["2011-06-20 08:00:00":"2011-06-20 09:00:00", "dry_bulb_faren"])

# Convert the wind_speed and dew_point_faren columns to numeric values
df_clean['wind_speed'] = pd.to_numeric(df_clean['wind_speed'], errors="coerce")
df_clean['dew_point_faren'] = pd.to_numeric(df_clean['dew_point_faren'], errors="coerce")
df_clean['visibility'] = pd.to_numeric(df_clean['visibility'], errors="coerce")


# Print the median of the dry_bulb_faren column
print(df_clean["dry_bulb_faren"].median())

# Print the median of the dry_bulb_faren column for the time range '2011-Apr':'2011-Jun'
print(df_clean.loc["2011-04":"2011-06", 'dry_bulb_faren'].median())

# Print the median of the dry_bulb_faren column for the month of January
print(df_clean.loc["2011-01", 'dry_bulb_faren'].median())


# Downsample df_clean by day and aggregate by mean: daily_mean_2011
daily_mean_2011 = df_clean.resample("D").mean()

# Extract the dry_bulb_faren column from daily_mean_2011 using .values: daily_temp_2011
daily_temp_2011 = daily_mean_2011["dry_bulb_faren"].values


# NEED FILE!
# Downsample df_climate by day and aggregate by mean: daily_climate
# daily_climate = df_climate.resample("D").mean()

# Extract the Temperature column from daily_climate using .reset_index(): daily_temp_climate
# daily_temp_climate = daily_climate.reset_index()["Temperature"]

# Compute the difference between the two arrays and print the mean difference
# difference = daily_temp_2011 - daily_temp_climate
# print(difference.mean())


# Select days that are sunny: sunny
sunny = df_clean.loc[df_clean["sky_condition"].str.strip() == "CLR"]

# Select days that are overcast: overcast
overcast = df_clean.loc[df_clean["sky_condition"].str.contains("OVC")]

# Resample sunny and overcast, aggregating by maximum daily temperature
sunny_daily_max = sunny.resample("D").max()
overcast_daily_max = overcast.resample("D").max()

# Print the difference between the mean of sunny_daily_max and overcast_daily_max
print(sunny_daily_max.mean() - overcast_daily_max.mean())


# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Select the visibility and dry_bulb_faren columns and resample them: weekly_mean
weekly_mean = df_clean[["visibility", "dry_bulb_faren"]].resample("W").mean()

# Print the output of weekly_mean.corr()
print(weekly_mean.corr())

# Plot weekly_mean with subplots=True
weekly_mean.plot(subplots=True)
# plt.show()
plt.savefig("_dummyPy067.png", bbox_inches="tight")
plt.clf()


# Create a Boolean Series for sunny days: sunny
sunny = df_clean["sky_condition"].str.strip() == "CLR"

# Resample the Boolean Series by day and compute the sum: sunny_hours
sunny_hours = sunny.resample("D").sum()

# Resample the Boolean Series by day and compute the count: total_hours
total_hours = sunny.resample("D").count()

# Divide sunny_hours by total_hours: sunny_fraction
sunny_fraction = sunny_hours / total_hours

# Make a box plot of sunny_fraction
sunny_fraction.plot(kind="box")
# plt.show()
plt.savefig("_dummyPy068.png", bbox_inches="tight")
plt.clf()


# Resample dew_point_faren and dry_bulb_faren by Month, aggregating the maximum values: monthly_max
monthly_max = df_clean[['dew_point_faren', 'dry_bulb_faren']].resample("M").max()

# Generate a histogram with bins=8, alpha=0.5, subplots=True
monthly_max.plot(kind="hist", bins=8, alpha=0.5, subplots=True)

# Show the plot
# plt.show()
plt.savefig("_dummyPy069.png", bbox_inches="tight")
plt.clf()


# Recall that df_climate is a separate dataset of the 1981-2010 data
# NEED DATASET
# Extract the maximum temperature in August 2010 from df_climate: august_max
# august_max = df_climate.loc["2010-Aug", "Temperature"].max()
# print(august_max)

# Resample the August 2011 temperatures in df_clean by day and aggregate the maximum value: august_2011
# august_2011 = df_clean.loc["2011-Aug", "dry_bulb_faren"].resample("D").max()

# Filter out days in august_2011 where the value exceeded august_max: august_2011_high
# august_2011_high = august_2011.loc[august_2011 > august_max]

# Construct a CDF of august_2011_high
# august_2011_high.plot(kind="hist", bins=25, normed=True, cumulative=True)

# Display the plot
# plt.show()
##   station             valid   tmpf   dwpf   relh    drct   sknt p01i   alti  \
## 0     AUS  2010-12-31 23:53  50.00  17.96  27.75  360.00  10.00    M  29.93   
## 1     AUS  2011-01-01 00:53  51.08  15.08  23.54  360.00  13.00    M  29.95   
## 2     AUS  2011-01-01 01:53  51.08  14.00  22.45  340.00   9.00    M  30.02   
## 3     AUS  2011-01-01 02:53  51.08  12.92  21.41   10.00  13.00    M  30.02   
## 4     AUS  2011-01-01 03:53  50.00  17.06  26.70  350.00   6.00    M  30.04   
## 
##       mslp      ...         skyl1 skyl2 skyl3 skyl4 presentwx  \
## 0  1013.20      ...       3900.00     M     M     M         M   
## 1  1014.20      ...       4500.00     M     M     M         M   
## 2  1016.20      ...       4900.00     M     M     M         M   
## 3  1016.20      ...       6000.00     M     M     M         M   
## 4  1017.00      ...       6500.00     M     M     M         M   
## 
##                                                metar        date   time  \
## 0  KAUS 010553Z 36010KT 10SM BKN039 10/M08 A2993 ...  2010-12-31  23:53   
## 1  KAUS 010653Z 36013KT 10SM OVC045 11/M09 A2995 ...  2011-01-01  00:53   
## 2  KAUS 010753Z 34009KT 10SM OVC049 11/M10 A3002 ...  2011-01-01  01:53   
## 3  KAUS 010853Z 01013KT 10SM OVC060 11/M11 A3002 ...  2011-01-01  02:53   
## 4  KAUS 010953Z 35006KT 10SM OVC065 10/M08 A3004 ...  2011-01-01  03:53   
## 
##   StationType sky_condition  
## 0     Airport  BKN           
## 1     Airport  OVC           
## 2     Airport  OVC           
## 3     Airport  OVC           
## 4     Airport  OVC           
## 
## [5 rows x 26 columns]
##   Wban StationType        date   Time dry_bulb_faren dew_point_faren  \
## 0  AUS     Airport  2010-12-31  23:53          50.00           17.96   
## 1  AUS     Airport  2011-01-01  00:53          51.08           15.08   
## 2  AUS     Airport  2011-01-01  01:53          51.08           14.00   
## 3  AUS     Airport  2011-01-01  02:53          51.08           12.92   
## 4  AUS     Airport  2011-01-01  03:53          50.00           17.06   
## 
##   sky_condition  station_pressure sea_level_pressure relative humidity  \
## 0  BKN                      29.93            1013.20             27.75   
## 1  OVC                      29.95            1014.20             23.54   
## 2  OVC                      30.02            1016.20             22.45   
## 3  OVC                      30.02            1016.20             21.41   
## 4  OVC                      30.04            1017.00             26.70   
## 
##   wind_direction wind_speed visibility  
## 0         360.00      10.00      10.00  
## 1         360.00      13.00      10.00  
## 2         340.00       9.00      10.00  
## 3          10.00      13.00      10.00  
## 4         350.00       6.00      10.00  
## (8432, 13)
##                     Wban StationType        date   Time dry_bulb_faren  \
## 2011-01-01 00:53:00  AUS     Airport  2011-01-01  00:53          51.08   
## 2011-01-01 01:53:00  AUS     Airport  2011-01-01  01:53          51.08   
## 2011-01-01 02:53:00  AUS     Airport  2011-01-01  02:53          51.08   
## 2011-01-01 03:53:00  AUS     Airport  2011-01-01  03:53          50.00   
## 2011-01-01 04:53:00  AUS     Airport  2011-01-01  04:53          50.00   
## 
##                     dew_point_faren sky_condition  station_pressure  \
## 2011-01-01 00:53:00           15.08  OVC                      29.95   
## 2011-01-01 01:53:00           14.00  OVC                      30.02   
## 2011-01-01 02:53:00           12.92  OVC                      30.02   
## 2011-01-01 03:53:00           17.06  OVC                      30.04   
## 2011-01-01 04:53:00           15.08  BKN                      30.04   
## 
##                     sea_level_pressure relative humidity wind_direction  \
## 2011-01-01 00:53:00            1014.20             23.54         360.00   
## 2011-01-01 01:53:00            1016.20             22.45         340.00   
## 2011-01-01 02:53:00            1016.20             21.41          10.00   
## 2011-01-01 03:53:00            1017.00             26.70         350.00   
## 2011-01-01 04:53:00            1017.20             24.50          20.00   
## 
##                     wind_speed visibility  
## 2011-01-01 00:53:00      13.00      10.00  
## 2011-01-01 01:53:00       9.00      10.00  
## 2011-01-01 02:53:00      13.00      10.00  
## 2011-01-01 03:53:00       6.00      10.00  
## 2011-01-01 04:53:00      10.00      10.00  
## (8431, 13)
## 2011-06-20 08:53:00    80.06
## Name: dry_bulb_faren, dtype: object
## 2011-06-20 08:53:00    80.06
## Name: dry_bulb_faren, dtype: float64
## 73.04
## 78.8
## 46.94
## dry_bulb_faren      6.827911
## dew_point_faren    -3.915446
## station_pressure   -0.002711
## wind_speed         -2.321292
## visibility          0.174696
## dtype: float64
##                 visibility  dry_bulb_faren
## visibility        1.000000        0.456775
## dry_bulb_faren    0.456775        1.000000

Mean Visibility and Temperature - Austin, TX 2011:

Percentage of Time with Clear Skies (CLR/SKC) by Day - Austin, TX 2011:

Histogram for Maximum Monthly Temperature and Dew Point - Austin, TX 2011:

Manipulating DataFrames with pandas

Chapter 1 - Extracting and transforming data

Indexing DataFrames - multiple ways to extract data from the pandas DataFrame:

  • Bracketing methodology - myDF[“myCol”][“myRow”] where myCol is the column name and myRow is the row index name
  • Column attribute methodology - myDF.myCol[“myRow”] where myCol is the column name IFF it is also a valid Python name
  • Accessors such as .loc and .iloc are much more programatically reprducible ways to get access to the data
    • The .loc accesses using labels
    • The .iloc accesses using index positions
  • Using labels - myDF.loc[“myRow”, “myCol”]
  • Using indices - myDF.iloc[myRowIdx, myColIdx]
  • To ensure getting back a pandas DataFrame, use a nested list - for example, myDF[[‘myColB’, ‘myColA’]] will return just myColA and myColB, with the result as a pandas DataFrame with myColB as the first column

Slicing DataFrames - different return types that come from indexing a pandas DataFrame:

  • A simple extract such as df[“myCol”] will return as pandas.core.series.Series, basically a 1-dimensional array that is a hybrid between a numpy array and a dictionary
  • A sliced extract such as df[“myCol”][a:b] will convert back to a more basic type (the type associated with myCol of the pandas DataFrame
    • Can use a:b:-1 to specify that the step size should be -1 rather than the default value of +1
  • Can also slice using names, and it INCLUDES both sides of the slice - df.loc[:, “myColA”:“myColB’] will extract all rows, as well as myColA/myColB (and would be all columns FROM myColA TO myColB
    • Can similarly slice on the row index names, such as myDF.loc[“rowIndexA”:“rowIndexB”, :]
    • Can slice both rows and columns at the same time also
    • Can also slice using index numbers and .iloc()
  • Can also slice using lists - either inside the .loc() or inside the .iloc()
  • In case there is a need to keep a pandas DataFrame, use df[[“myCol”]], as opposed to df[“myCol”] which will return a pandas Series

Filtering DataFrames - general tool for selecting part of the data based on its properties rather than its indices (typically by way of Booleans):

  • The basic example would be myDF[myDF[“myCol”] > a], which will extract all the rows where myDF.myCol exceeds a
  • Filters can be combined using the &, |, and not operators
  • Selecting columns that have exclusively non-zero (note that NaN is not zero!), can be achieved using myDF.all() - so myDF.loc[:, myDF.all()]
    • Alternately, can use myDF.any() to obtain every column that has 1+ non-zero values
    • Alternately, can use myDF.isnull() to identify the NaN, so myDF.isnull().any() will be the columns that have 1+ NaN
    • Similarly, can use myDF.notnull() to identify the non-NaN, so myDF.notnull().all() will be the columns that have 0 NaN
  • Can remove any rows with missing data using .dropna(), such as myDF.dropna(how=“any”) – note that how = “any” drops ROWS with any NaN while how = “all” drops ROWS with only NaN
  • Can also run operations such as myDF[“myColA”][myDF[“myColB”] > x] += y to add y to the myColA any time the myColB exceeds x

Transforming DataFrames - best practice is to use built-in pandas methods, and otherwise by universal numpy methods:

  • For example, myDF.floordiv(a) will take every column, divide by a, and the return the floor
    • Could alternately run np.floor_divide(myDF, a)
    • Whether using the pandas method or the numpy function, the operation is vectorized (run element by element)
  • Can also run a custom function using myDF.apply(myFunc), which defaults to running vectorized (element by element)
    • Can also use lambda functions, such as myDF.apply(lambda x: x // a)
  • The default for all of these operations is to create a new pandas DataFrame, so the existing DataFrame is not touched; can assign the result as needed
  • Can access the indices for the DataFrame using myDF.index (this will be a list of strings)
    • By using the .str operator, you can access all of the string operations - myDF.index.str.upper() will take all the index strings and convert them to upper
  • The index cannot use .apply() and instead uses .map() - myDF.index.map(str.lower) will convert all the index values to lower
  • Can consider .map() to be applying a dictionary to any specific piece of information
    • As a result, the .map() can only be applied to a Series and not to a DataFrame
  • Can use arithmetic operations directly on the columns - myDF[“myColA”] + myDF[“myColB”] will add the columns together

Example code includes:


myPath = "./PythonInputFiles/"
import pandas as pd


# NEED DATA FRAME election (67 x 8) - indexed by county with columns state (PA) - total - Obama - Romney - winner - voters - turnout - margin
# appears to be 2012 US general election data, with the Obama and Romney columns being percentages, total being total votes, and voters being registered voters
# Saved the DataCamp file to myPath + "PAElection_2012.csv"

electionPA = pd.read_csv(myPath + "PAElection_2012.csv", index_col="county")
election = electionPA.copy()


# Assign the row position of election.loc['Bedford']: x
x = 4

# Assign the column position of election['winner']: y
y = 4

# Print the boolean equivalence
print(election.iloc[x, y] == election.loc['Bedford', 'winner'])


# DO NOT RUN - downloaded to myPath + "PAElection2012.csv" instead
# filename = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/pennsylvania2012.csv'
# election = pd.read_csv(filename, index_col='county')

# Create a separate dataframe with the columns ['winner', 'total', 'voters']: results
results = election[['winner', 'total', 'voters']]

# Print the output of results.head()
print(results.head())


# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:, :"Obama"]

# Print the output of left_columns.head()
print(left_columns.head())

# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:, "Obama":"winner"]

# Print the output of middle_columns.head()
print(middle_columns.head())

# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:, "Romney":]

# Print the output of right_columns.head()
print(right_columns.head())


# Create the list of row labels: rows
rows = ['Philadelphia', 'Centre', 'Fulton']

# Create the list of column labels: cols
cols = ['winner', 'Obama', 'Romney']

# Create the new DataFrame: three_counties
three_counties = election.loc[rows, cols]

# Print the three_counties DataFrame
print(three_counties)


# Create a turnout category
election["turnout"] = 100 * election["total"] / election["voters"]

# Create the boolean array: high_turnout
high_turnout = election["turnout"] > 70

# Filter the election DataFrame with the high_turnout array: high_turnout_df
high_turnout_df = election[high_turnout]

# Print the high_turnout_results DataFrame
print(high_turnout_df)


# Import numpy
import numpy as np

# Create the election["margin"] column
election["margin"] = abs(election["Obama"] - election["Romney"])

# Create the boolean array: too_close
too_close = election["margin"] < 1

# Assign np.nan to the 'winner' column where the results were too close to call
election["winner"][too_close] = np.nan

# Print the output of election.info()
print(election.info())


# NEED DATASET titanic (1309 x 14)
# User version saved previously
titanic = pd.read_csv(myPath + 'titanic.csv', index_col=0)


# Select the 'age' and 'cabin' columns: df
df = titanic[["Age", "Cabin"]]

# Print the shape of df
print(df.shape)

# Drop rows in df with how='any' and print the shape
print(df.dropna(how="any").shape)

# Drop rows in df with how='all' and print the shape
print(df.dropna(how="all").shape)

# Call .dropna() with thresh=1000 and axis='columns' and print the output of .info() from titanic
print(titanic.dropna(thresh=500, axis='columns').info())


# NEED DATASET weather which is 365 x 23 from Weather Underground, representing Pittsburgh weather data for 2013
# https://www.wunderground.com/history
# Use the KORD METAR data instead
# Load the file
tmpORD = pd.read_csv(myPath + "KORD_2010_from_IAState.txt", header=5)
tmpORD.columns = tmpORD.columns.str.strip()
isMETAR = tmpORD.loc[:, "valid"].str.contains(":51")  # KORD METAR are taken at xx:51
useORD = tmpORD.loc[isMETAR, :]  # ends as 8709 x 22, probably the METAR check missed a few at "off" times

date_list = useORD["valid"]
time_format = '%Y-%m-%d %H:%M'
my_datetimes = pd.to_datetime(date_list, format=time_format)  
useORD.index = my_datetimes

# Just keep the temperature and dew point
weather = useORD[["tmpf", "dwpf"]]
weather.columns = ['Mean TemperatureF','Mean Dew PointF']

# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
    return 5/9*(F - 32)

# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = weather[['Mean TemperatureF','Mean Dew PointF']].apply(to_celsius)

# Reassign the columns df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']

# Print the output of df_celsius.head()
print(df_celsius.head())


# Create the dictionary: red_vs_blue
red_vs_blue = {"Obama":"blue", "Romney":"red"}

# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election["winner"].map(red_vs_blue)

# Print the output of election.head()
print(election.head())


# Import zscore from scipy.stats
# Need to solve BLAS/LAPACK issue - cannot get scipy to download and install . . . 
# from scipy.stats import zscore

import numpy as np
def zscore(x):
    mu = np.mean(x)
    sd = np.std(x)
    return((x - mu) / sd)

# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election["turnout"])

# Print the type of turnout_zscore
print(type(turnout_zscore))

# Assign turnout_zscore to a new column: election['turnout_zscore']
election["turnout_zscore"] = turnout_zscore

# Print the output of election.head()
print(election.head())
## -c:90: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame
## 
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
## True
##            winner   total  voters
## county                           
## Adams      Romney   41973   61156
## Allegheny   Obama  614671  924351
## Armstrong  Romney   28322   42147
## Beaver     Romney   80015  115157
## Bedford    Romney   21444   32189
##           state   total      Obama
## county                            
## Adams        PA   41973  35.482334
## Allegheny    PA  614671  56.640219
## Armstrong    PA   28322  30.696985
## Beaver       PA   80015  46.032619
## Bedford      PA   21444  22.057452
##                Obama     Romney  winner
## county                                 
## Adams      35.482334  63.112001  Romney
## Allegheny  56.640219  42.185820   Obama
## Armstrong  30.696985  67.901278  Romney
## Beaver     46.032619  52.637630  Romney
## Bedford    22.057452  76.986570  Romney
##               Romney  winner  voters
## county                              
## Adams      63.112001  Romney   61156
## Allegheny  42.185820   Obama  924351
## Armstrong  67.901278  Romney   42147
## Beaver     52.637630  Romney  115157
## Bedford    76.986570  Romney   32189
##               winner      Obama     Romney
## county                                    
## Philadelphia   Obama  85.224251  14.051451
## Centre        Romney  48.948416  48.977486
## Fulton        Romney  21.096291  77.748861
##              state   total      Obama     Romney  winner  voters    turnout
## county                                                                     
## Bucks           PA  319407  49.966970  48.801686   Obama  435606  73.324748
## Butler          PA   88924  31.920516  66.816607  Romney  122762  72.436096
## Chester         PA  248295  49.228539  49.650617  Romney  337822  73.498766
## Forest          PA    2308  38.734835  59.835355  Romney    3232  71.410891
## Franklin        PA   62802  30.110506  68.583803  Romney   87406  71.850903
## Montgomery      PA  401787  56.637223  42.286834   Obama  551105  72.905708
## Westmoreland    PA  168709  37.567646  61.306154  Romney  238006  70.884347
## <class 'pandas.core.frame.DataFrame'>
## Index: 67 entries, Adams to York
## Data columns (total 8 columns):
## state      67 non-null object
## total      67 non-null int64
## Obama      67 non-null float64
## Romney     67 non-null float64
## winner     64 non-null object
## voters     67 non-null int64
## turnout    67 non-null float64
## margin     67 non-null float64
## dtypes: float64(4), int64(2), object(2)
## memory usage: 5.4+ KB
## None
## (891, 2)
## (185, 2)
## (733, 2)
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 891 entries, 1 to 891
## Data columns (total 11 columns):
## PassengerId    891 non-null int64
## Survived       891 non-null int64
## Pclass         891 non-null int64
## Name           891 non-null object
## Sex            891 non-null object
## Age            714 non-null float64
## SibSp          891 non-null int64
## Parch          891 non-null int64
## Ticket         891 non-null object
## Fare           891 non-null float64
## Embarked       889 non-null object
## dtypes: float64(2), int64(5), object(4)
## memory usage: 69.6+ KB
## None
##                      Mean TemperatureC  Mean Dew PointC
## valid                                                  
## 2010-01-01 00:51:00               -9.4            -16.1
## 2010-01-01 01:51:00              -10.0            -16.1
## 2010-01-01 02:51:00              -11.1            -16.1
## 2010-01-01 03:51:00              -11.7            -16.7
## 2010-01-01 04:51:00              -12.2            -16.7
##           state   total      Obama     Romney  winner  voters    turnout  \
## county                                                                     
## Adams        PA   41973  35.482334  63.112001  Romney   61156  68.632677   
## Allegheny    PA  614671  56.640219  42.185820   Obama  924351  66.497575   
## Armstrong    PA   28322  30.696985  67.901278  Romney   42147  67.198140   
## Beaver       PA   80015  46.032619  52.637630  Romney  115157  69.483401   
## Bedford      PA   21444  22.057452  76.986570  Romney   32189  66.619031   
## 
##               margin color  
## county                      
## Adams      27.629667   red  
## Allegheny  14.454399  blue  
## Armstrong  37.204293   red  
## Beaver      6.605012   red  
## Bedford    54.929118   red  
## <class 'pandas.core.series.Series'>
##           state   total      Obama     Romney  winner  voters    turnout  \
## county                                                                     
## Adams        PA   41973  35.482334  63.112001  Romney   61156  68.632677   
## Allegheny    PA  614671  56.640219  42.185820   Obama  924351  66.497575   
## Armstrong    PA   28322  30.696985  67.901278  Romney   42147  67.198140   
## Beaver       PA   80015  46.032619  52.637630  Romney  115157  69.483401   
## Bedford      PA   21444  22.057452  76.986570  Romney   32189  66.619031   
## 
##               margin color  turnout_zscore  
## county                                      
## Adams      27.629667   red        0.853734  
## Allegheny  14.454399  blue        0.439846  
## Armstrong  37.204293   red        0.575650  
## Beaver      6.605012   red        1.018647  
## Bedford    54.929118   red        0.463391

Chapter 2 - Advanced Indexing

Index objects and labeled data - one of the key building blocks of the pandas Data Structures:

  • There are several key building blocks for a pandas DataFrame
    • Indexes: Sequence of labels that must be immutable and homogenous in data type
    • Series: 1D array with index
    • DataFrames: 2D array with index
  • Can create a pandas Series using pd.Series(myList, index=myIndex) where the default for index is integers starting at 0
    • The index can be sliced just like a list and always has the .name attribute (default at creation is None)
  • Sometimes, it is valuable to make one of the Series columns in the DataFrame in to the overall index
    • myDF.index = myDF[“keyCol”] will make the index assignment
    • del myDF[“keyCol”] will remove keyCol from the data
  • Can also set indices inside pd.read_csv by using the index_col= options

Hierarchical indexing - representing multi-dimensional index data:

  • An example would be stock price data, which might be unique by Date-Symbol rather than just being unique by Date or Symbol
  • Can use tuples combined with .set_index() to solve this - myDF.set_index([“Symbol”, “Date”])
    • This will have myDF.index.name = None and myDF.index.names = [“Symbol”, “Date”]
  • Can sort the MultiIndex using .sort_index(), which appears to sort by the first element (Symbol in this case), then the second element (Date in this case)
  • Can access from the MultiIndex using tuples, such as myDF.loc[(‘CSCO’, ‘2016-10-01’)] to get the row that contains the CSCO data from 2016-10-01
    • Can further use slicing on the outermost index, such as myDF.loc[‘CSCO’ : ‘MSFT’]
    • Can further index on both components, such as myDF.loc[ ([“AAPL”, “CSCO”], “2016-10-05”), : ]
  • When slicing on both indices, the colon is not recognized as a key symbol
    • The keyword slice() can be added, and can access slice(None) meaning “everything”
    • myDF.loc[ (slice(None), slice(“2016-10-03”, “2016-10-05”)), : ] enforces that the inner index should be sliced as 2016-10-03 : 2016-10-05

Example code includes:


myPath = "./PythonInputFiles/"

import pandas as pd
import numpy as np

sales = pd.DataFrame()
sales["eggs"] = [47, 110, 221, 77, 132, 205]
sales["salt"] = [12, 50, 89, 87, np.nan, 60]
sales["spam"] = [17, 31, 72, 20, 52, 55]
sales.index = ["jan", "feb", "mar", "apr", "may", "jun"]


# Create the list of new indexes: new_idx
new_idx = [x.upper() for x in sales.index]

# Assign new_idx to sales.index
sales.index = new_idx

# Print the sales DataFrame
print(sales)


# Assign the string 'MONTHS' to sales.index.name
sales.index.name = "MONTHS"

# Print the sales DataFrame
print(sales)

# Assign the string 'PRODUCTS' to sales.columns.name 
sales.columns.name = "PRODUCTS"

# Print the sales dataframe again
print(sales)


# Generate the list of months: months
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']

# Assign months to sales.index
sales.index = months

# Print the modified sales DataFrame
print(sales)


# NEED TO MODIFY sales so it is the same data but indexed as CA/1, CA/2, NY/1, NY/2, TX/1, TX/2 (using state-month)
sales = sales.set_index([["CA", "CA", "NY", "NY","TX", "TX"], [1, 2, 1, 2, 1, 2]])

# Print sales.loc[['CA', 'TX']]
print(sales.loc[['CA', 'TX']])

# Print sales['CA':'TX']
print(sales['CA':'TX'])


# Now, sales is again a non-indexed DataFrame with sate-month as columns
# Set the index to be the columns ['state', 'month']: sales
states = [x for x, y in list(sales.index)]
months = [y for x, y in list(sales.index)]

sales.index = range(sales.shape[0])
sales["state"] = states
sales["month"] = months
oldSales = sales.copy()

sales = sales.set_index(['state', 'month'])

# Sort the MultiIndex: sales
sales = sales.sort_index(ascending=False)

# Print the sales DataFrame
print(sales)
multiSales = sales.copy()


# Go back to the sales as it was prior to indexing in the above step
# Set the index to the column 'state': sales
sales = oldSales.set_index(["state"])

# Print the sales DataFrame
print(sales)

# Access the data from 'NY'
print(sales.loc["NY"])


# Go back to sales as the Multi-Index dataset again . . . 
sales = multiSales.copy()
sales = sales.sort_index(ascending=True)  # Could not grab witout error unless ascending=True

# Look up data for NY in month 1: NY_month1
NY_month1 = sales.loc[ ("NY", 1) ]

# Look up data for CA and TX in month 2: CA_TX_month2
CA_TX_month2 = sales.loc[ (["CA", "TX"], 2) , :]

# Look up data for all states in month 2: all_month2
all_month2 = sales.loc[ (slice(None), 2), :]
##      eggs  salt  spam
## JAN    47  12.0    17
## FEB   110  50.0    31
## MAR   221  89.0    72
## APR    77  87.0    20
## MAY   132   NaN    52
## JUN   205  60.0    55
##         eggs  salt  spam
## MONTHS                  
## JAN       47  12.0    17
## FEB      110  50.0    31
## MAR      221  89.0    72
## APR       77  87.0    20
## MAY      132   NaN    52
## JUN      205  60.0    55
## PRODUCTS  eggs  salt  spam
## MONTHS                    
## JAN         47  12.0    17
## FEB        110  50.0    31
## MAR        221  89.0    72
## APR         77  87.0    20
## MAY        132   NaN    52
## JUN        205  60.0    55
## PRODUCTS  eggs  salt  spam
## Jan         47  12.0    17
## Feb        110  50.0    31
## Mar        221  89.0    72
## Apr         77  87.0    20
## May        132   NaN    52
## Jun        205  60.0    55
## PRODUCTS  eggs  salt  spam
## CA 1        47  12.0    17
##    2       110  50.0    31
## TX 1       132   NaN    52
##    2       205  60.0    55
## PRODUCTS  eggs  salt  spam
## CA 1        47  12.0    17
##    2       110  50.0    31
## NY 1       221  89.0    72
##    2        77  87.0    20
## TX 1       132   NaN    52
##    2       205  60.0    55
## PRODUCTS     eggs  salt  spam
## state month                  
## TX    2       205  60.0    55
##       1       132   NaN    52
## NY    2        77  87.0    20
##       1       221  89.0    72
## CA    2       110  50.0    31
##       1        47  12.0    17
## PRODUCTS  eggs  salt  spam  month
## state                            
## CA          47  12.0    17      1
## CA         110  50.0    31      2
## NY         221  89.0    72      1
## NY          77  87.0    20      2
## TX         132   NaN    52      1
## TX         205  60.0    55      2
## PRODUCTS  eggs  salt  spam  month
## state                            
## NY         221  89.0    72      1
## NY          77  87.0    20      2

Chapter 3 - Rearranging and Reshaping Data

Pivoting DataFrames - changing shapes to one that better suits analysis needs:

  • The .pivot() method allows for specifying an index (row variables), a columns variable, and a values variable
    • myDF.pivot(index=“idxVar”, columns=“colVar”, values=“valVar”) will create a table with idxVar as the rows, colVar as the columns, and valVar as the cell values
    • If values= is omitted, then all other columns are used for values, with a separate set of columns made for each of those values variables

Stacking and unstaking DataFrames - the idea of moving variables to/from the index so that the columns match data needs:

  • myDF.unstack(level=“myVar”) will move myVar out of the index and instead place it as a hieracrchical component of the column variables
    • Can instead use an index number for the level=
  • myDF.stack(level=“myVar”) moves a hierarchical component of the column variables in to the index instead
  • myDF.swaplevel(0, 1) will change the hierarchy of the multi-index so that the first-order becomes the second-order and the second-order becomes the first-order
    • myDF.sort_index() might then be needed since the .swaplevel() does not re-order the rows; it just changes who is first/second

Melting DataFrames - converting pivoted data back in to a column format:

  • pd.melt(myDF, id_vars=) will convert everything other than the id_vars back to a column called “variable” and a column called “value”
    • Can also use var_name= and value_name= with more descriptive strings to avoid the names “variable” and “value”
  • There is also the option to use value_vars= to specify the columns to un-pivot (default is everything not listed in id_vars)

Pivot tables are needed when there are multiple rows with the same index (if pivoted) - need to specify how to manage the duplicates:

  • myDF.pivot_table(index=, columns=, values=, aggfunc=)
    • The default is that aggfunc=“mean” but can specify “sum” or “count” or the like instead

Example code includes:


myPath = "./PythonInputFiles/"


import pandas as pd

users=pd.DataFrame()
users["weekday"] = ["Sun", "Sun", "Mon", "Mon"]
users["city"] = ["Austin", "Dallas", "Austin", "Dallas"]
users["visitors"] = [139, 237, 326, 456]
users["signups"] = [7, 12, 3, 5]


# Pivot the users DataFrame: visitors_pivot
visitors_pivot = users.pivot(index="weekday", columns="city", values="visitors")

# Print the pivoted DataFrame
print(visitors_pivot)


# Pivot users with signups indexed by weekday and city: signups_pivot
signups_pivot = users.pivot(index="weekday", columns="city", values="signups")

# Print signups_pivot
print(signups_pivot)


# Pivot users pivoted by both signups and visitors: pivot
pivot = users.pivot(index="weekday", columns="city")

# Print the pivoted DataFrame
print(pivot)


a = users.set_index(["city", "weekday"])
users = a.sort_index()


# Unstack users by 'weekday': byweekday
byweekday = users.unstack(level="weekday")

# Print the byweekday DataFrame
print(byweekday)

# Stack byweekday by 'weekday' and print it
print(byweekday.stack(level="weekday"))


# Unstack users by 'city': bycity
bycity = users.unstack(level="city")

# Print the bycity DataFrame
print(bycity)

# Stack bycity by 'city' and print it
print(bycity.stack(level="city"))


# Stack 'city' back into the index of bycity: newusers
newusers = bycity.stack(level="city")

# Swap the levels of the index of newusers: newusers
newusers = newusers.swaplevel(0, 1)

# Print newusers and verify that the index is not sorted
print(newusers)

# Sort the index of newusers: newusers
newusers = newusers.sort_index()

# Print newusers and verify that the index is now sorted
print(newusers)

# Verify that the new DataFrame is equal to the original
print(newusers.equals(users))


visitors_by_city_weekday = users[["visitors"]].unstack(level="city").reset_index()
visitors_by_city_weekday.columns = ["weekday", "Austin", "Dallas"]


# Reset the index: visitors_by_city_weekday
# visitors_by_city_weekday = visitors_by_city_weekday.reset_index()  # this needed to be done above to get the column names right . . . 

# Print visitors_by_city_weekday
print(visitors_by_city_weekday)

# Melt visitors_by_city_weekday: visitors
visitors = pd.melt(visitors_by_city_weekday, id_vars=["weekday"], value_name="visitors", var_name="city")

# Print visitors
print(visitors)


users=pd.DataFrame()
users["weekday"] = ["Sun", "Sun", "Mon", "Mon"]
users["city"] = ["Austin", "Dallas", "Austin", "Dallas"]
users["visitors"] = [139, 237, 326, 456]
users["signups"] = [7, 12, 3, 5]

# Melt users: skinny
skinny = pd.melt(users, id_vars = ["weekday", "city"], value_vars=["visitors", "signups"])

# Print skinny
print(skinny)


# Set the new index: users_idx
users_idx = users.set_index(['city', 'weekday'])

# Print the users_idx DataFrame
print(users_idx)

# Obtain the key-value pairs: kv_pairs
kv_pairs = pd.melt(users_idx, col_level=0)

# Print the key-value pairs
print(kv_pairs)


# Create the DataFrame with the appropriate pivot table: by_city_day
by_city_day = users.pivot_table(index="weekday", columns="city")

# Print by_city_day
print(by_city_day)


# Use a pivot table to display the count of each column: count_by_weekday1
count_by_weekday1 = users.pivot_table(index="weekday", aggfunc="count")

# Print count_by_weekday
print(count_by_weekday1)


# Replace 'aggfunc='count'' with 'aggfunc=len': count_by_weekday2
count_by_weekday2 = users.pivot_table(index="weekday", aggfunc=len)

# Verify that the same result is obtained
print('==========================================')
print(count_by_weekday1.equals(count_by_weekday2))


# Create the DataFrame with the appropriate pivot table: signups_and_visitors
signups_and_visitors = users.pivot_table(index="weekday", aggfunc=sum)

# Print signups_and_visitors
print(signups_and_visitors)

# Add in the margins: signups_and_visitors_total 
signups_and_visitors_total = users.pivot_table(index="weekday", aggfunc=sum, margins=True)

# Print signups_and_visitors_total
print(signups_and_visitors_total)
## city     Austin  Dallas
## weekday                
## Mon         326     456
## Sun         139     237
## city     Austin  Dallas
## weekday                
## Mon           3       5
## Sun           7      12
##         visitors        signups       
## city      Austin Dallas  Austin Dallas
## weekday                               
## Mon          326    456       3      5
## Sun          139    237       7     12
##         visitors      signups    
## weekday      Mon  Sun     Mon Sun
## city                             
## Austin       326  139       3   7
## Dallas       456  237       5  12
##                 visitors  signups
## city   weekday                   
## Austin Mon           326        3
##        Sun           139        7
## Dallas Mon           456        5
##        Sun           237       12
##         visitors        signups       
## city      Austin Dallas  Austin Dallas
## weekday                               
## Mon          326    456       3      5
## Sun          139    237       7     12
##                 visitors  signups
## weekday city                     
## Mon     Austin       326        3
##         Dallas       456        5
## Sun     Austin       139        7
##         Dallas       237       12
##                 visitors  signups
## city   weekday                   
## Austin Mon           326        3
## Dallas Mon           456        5
## Austin Sun           139        7
## Dallas Sun           237       12
##                 visitors  signups
## city   weekday                   
## Austin Mon           326        3
##        Sun           139        7
## Dallas Mon           456        5
##        Sun           237       12
## True
##   weekday  Austin  Dallas
## 0     Mon     326     456
## 1     Sun     139     237
##   weekday    city  visitors
## 0     Mon  Austin       326
## 1     Sun  Austin       139
## 2     Mon  Dallas       456
## 3     Sun  Dallas       237
##   weekday    city  variable  value
## 0     Sun  Austin  visitors    139
## 1     Sun  Dallas  visitors    237
## 2     Mon  Austin  visitors    326
## 3     Mon  Dallas  visitors    456
## 4     Sun  Austin   signups      7
## 5     Sun  Dallas   signups     12
## 6     Mon  Austin   signups      3
## 7     Mon  Dallas   signups      5
##                 visitors  signups
## city   weekday                   
## Austin Sun           139        7
## Dallas Sun           237       12
## Austin Mon           326        3
## Dallas Mon           456        5
##    variable  value
## 0  visitors    139
## 1  visitors    237
## 2  visitors    326
## 3  visitors    456
## 4   signups      7
## 5   signups     12
## 6   signups      3
## 7   signups      5
##         signups        visitors       
## city     Austin Dallas   Austin Dallas
## weekday                               
## Mon           3      5      326    456
## Sun           7     12      139    237
##          city  signups  visitors
## weekday                         
## Mon         2        2         2
## Sun         2        2         2
## ==========================================
## True
##          signups  visitors
## weekday                   
## Mon            8       782
## Sun           19       376
##          signups  visitors
## weekday                   
## Mon          8.0     782.0
## Sun         19.0     376.0
## All         27.0    1158.0

Chapter 4 - Grouping data

Categoricals and groupby - using the .groupby() method and then chaining various commands to it:

  • myDF.groupby(“myGroupVar”).count() will provide a count summarized by myGroupVar (it is a count by column, though . . . )
  • In essence, this is running the split-apply-combine methodology, where the .groupby() is the split, the .count() is the apply, and the combine is the result by default
  • Can act on a subset of columns using myDF.groupby(“myGroupVar”)[[“myColA”, “myColB”]].sum() to get sums of myColA/myColB by myGroupVar
    • Can also have a multi-level .groupby() such as myDF.groupby([“myGroupA”, “myGroupB”]).mean()
    • Can also use a .groupby(myVar) provided that myVar has been created to have the same index as the pandas DataFrame
  • With categorical data, use .unique() to get the unique values
    • Create categorical variables using .astype(“category”)
  • Categorical variables use less memory and speed up group-by processing

Groupby and aggregation - running mutlipe calculations after the split and before the combine:

  • Can use .agg([“max”, “sum”]) to run both max() and sum() on the data (will get both values back in the results)
    • Can pass a list of quoted strings that reflect built-in functions
    • Can pass an unquoted function name that is a custom user-defined function
    • Can pass in a dictionary where the keys are the variables and the values are the functions to be run on those variables

Groupby and transformation - applying different transformations to different groups:

  • myDF.groupby(“myGroupVar”).transform(myFunc) will apply myFunc separately to each group of myGroupVar, returning the same index/order as myDF
  • The .transform() is applying an element-wise calculation within each of the groups
  • Can also use myDF.groupby(“myGroupVar”).apply(myFunc) if the myFunc is too complicated to be implemented by way of .transform()

Groupby and filtering - filtering groups prior to aggregating:

  • The .groupby() is essentially creating a dictionary with keys being the groups and values being the associated data within that group
    • So, if splitting = myDF.groupby(“myGroupVar”) then for groupName, groupData in splitting: is a valid syntax
    • This opens up the ability to filter within a for loop, so that the results provided are just for the desired filtering criteria
    • Can also use a dictionary comprehension {} to get these back as a dictionary, followed by pd.Series() to print the dictionary with keys as indices
  • Can also use booleans as part of the groupby() if the goal is to get (for example) averages by whether something is in/out of a certain key class
    • myDF.groupby([“myGroupVar”, myBoolSeries]).mean() will provide the mean grouped by myGroupVar and myBoolSeries

Example code includes:


myPath = "./PythonInputFiles/"


# Need to bring in "titanic" (1309 x 14)
import pandas as pd
titanic = pd.read_csv(myPath + 'titanic.csv', index_col=0)

titanic.columns = ['id', 'survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked']

# titanic.columns = ['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']

# Group titanic by 'pclass'
by_class = titanic.groupby("pclass")

# Aggregate 'survived' column of by_class by count
count_by_class = by_class["survived"].count()

# Print count_by_class
print(count_by_class)

# Group titanic by 'embarked' and 'pclass'
by_mult = titanic.groupby(["embarked", "pclass"])

# Aggregate 'survived' column of by_mult by count
count_mult = by_mult["survived"].count()

# Print count_mult
print(count_mult)


# Saved to myPath as lifeSaved.csv and regionsSaved.csv
# life_f = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/life_expectancy.csv'
# regions_f = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/regions.csv'

life = pd.read_csv(myPath + "lifeSaved.csv", index_col='Country', encoding="latin-1")
regions = pd.read_csv(myPath + "regionsSaved.csv", index_col='Country', encoding="latin-1")

# Group life by regions['region']: life_by_region
life_by_region = life.groupby(regions["region"])

# Print the mean over the '2010' column of life_by_region
print(life_by_region["2010"].mean())


# Again using the titanic dataset (same as above)

# Group titanic by 'pclass': by_class
by_class = titanic.groupby("pclass")

# Select 'age' and 'fare'
by_class_sub = by_class[['age','fare']]

# Aggregate by_class_sub by 'max' and 'median': aggregated
aggregated = by_class_sub.agg(["max", "median"])

# Print the maximum age in each class
print(aggregated.loc[:, ('age','max')])

# Print the median fare in each class
print(aggregated.loc[:, ('fare', 'median')])


# Read the CSV file into a DataFrame and sort the index: gapminder
# NEED FILE!
# gapminder = pd.read_csv("gapminder.csv", index_col=['Year','region','Country']).sort_index()

# Group gapminder by 'Year' and 'region': by_year_region
# by_year_region = gapminder.groupby(level=["Year", "region"])

# Define the function to compute spread: spread
# def spread(series):
#     return series.max() - series.min()

# Create the dictionary: aggregator
# aggregator = {'population':'sum', 'child_mortality':'mean', 'gdp':spread}

# Aggregate by_year_region using the dictionary: aggregated
# aggregated = by_year_region.agg(aggregator)

# Print the last 6 entries of aggregated 
# print(aggregated.tail(6))


# NEED FILE
# Read file: sales
# sales = pd.read_csv("sales.csv", index_col="Date", parse_dates=True)

# Create a groupby object: by_day
# by_day = sales.groupby(sales.index.strftime('%a'))

# Create sum: units_sum
# units_sum = by_day.sum()

# Print units_sum
# print(units_sum)


# Import zscore
# from scipy.stats import zscore

# Group gapminder_2010: standardized
# standardized = gapminder_2010.groupby("region")[['life','fertility']].transform(zscore)

# Construct a Boolean Series to identify outliers: outliers
# outliers = (standardized['life'] < -3) | (standardized['fertility'] > 3)

# Filter gapminder_2010 by the outliers: gm_outliers
# gm_outliers = gapminder_2010.loc[outliers]

# Print gm_outliers
# print(gm_outliers)


# Create a groupby object: by_sex_class
by_sex_class = titanic.groupby(["sex", "pclass"])

# Write a function that imputes median
def impute_median(series):
    return series.fillna(series.median())

# Impute age and assign to titanic['age']
titanic.age = by_sex_class["age"].transform(impute_median)

# Print the output of titanic.tail(10)
print(titanic.tail(10))


def disparity(gr):
    # Compute the spread of gr['gdp']: s
    s = gr['gdp'].max() - gr['gdp'].min()
    # Compute the z-score of gr['gdp'] as (gr['gdp']-gr['gdp'].mean())/gr['gdp'].std(): z
    z = (gr['gdp'] - gr['gdp'].mean())/gr['gdp'].std()
    # Return a DataFrame with the inputs {'z(gdp)':z, 'regional spread(gdp)':s}
    return pd.DataFrame({'z(gdp)':z , 'regional spread(gdp)':s})


# NEED FILE!
# Group gapminder_2010 by 'region': regional
# regional = gapminder_2010.groupby("region")

# Apply the disparity function on regional: reg_disp
# reg_disp = regional.apply(disparity)

# Print the disparity of 'United States', 'United Kingdom', and 'China'
# print(reg_disp.loc[['United States','United Kingdom','China'], :])


def c_deck_survival(gr):
    c_passengers = gr['cabin'].str.startswith('C').fillna(False)
    return gr.loc[c_passengers, 'survived'].mean()


# Create a groupby object using titanic over the 'sex' column: by_sex
by_sex = titanic.groupby("sex")

# Call by_sex.apply with the function c_deck_survival and print the result
c_surv_by_sex = by_sex.apply(c_deck_survival)

# Print the survival rates
print(c_surv_by_sex)


# NEED FILE!
# Read the CSV file into a DataFrame: sales
# sales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True)

# Group sales by 'Company': by_company
# by_company = sales.groupby("Company")

# Compute the sum of the 'Units' of by_company: by_com_sum
# by_com_sum = by_company["Units"].sum()
# print(by_com_sum)

# Filter 'Units' where the sum is > 35: by_com_filt
# by_com_filt = by_company.filter(lambda g:g['Units'].sum() > 35)
# print(by_com_filt)


# Create the Boolean Series: under10
under10 = (titanic['age'] < 10).map({True:'under 10', False:'over 10'})

# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)["survived"].mean()
print(survived_mean_1)

# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10, "pclass"])["survived"].mean()
print(survived_mean_2)
## pclass
## 1    216
## 2    184
## 3    491
## Name: survived, dtype: int64
## embarked  pclass
## C         1          85
##           2          17
##           3          66
## Q         1           2
##           2           3
##           3          72
## S         1         127
##           2         164
##           3         353
## Name: survived, dtype: int64
## region
## America                       74.037350
## East Asia & Pacific           73.405750
## Europe & Central Asia         75.656387
## Middle East & North Africa    72.805333
## South Asia                    68.189750
## Sub-Saharan Africa            57.575080
## Name: 2010, dtype: float64
## pclass
## 1    80.0
## 2    70.0
## 3    74.0
## Name: (age, max), dtype: float64
## pclass
## 1    60.2875
## 2    14.2500
## 3     8.0500
## Name: (fare, median), dtype: float64
##       id  survived  pclass                                      name     sex  \
## 882  882         0       3                        Markun, Mr. Johann    male   
## 883  883         0       3              Dahlberg, Miss. Gerda Ulrika  female   
## 884  884         0       2             Banfield, Mr. Frederick James    male   
## 885  885         0       3                    Sutehall, Mr. Henry Jr    male   
## 886  886         0       3      Rice, Mrs. William (Margaret Norton)  female   
## 887  887         0       2                     Montvila, Rev. Juozas    male   
## 888  888         1       1              Graham, Miss. Margaret Edith  female   
## 889  889         0       3  Johnston, Miss. Catherine Helen "Carrie"  female   
## 890  890         1       1                     Behr, Mr. Karl Howell    male   
## 891  891         0       3                       Dooley, Mr. Patrick    male   
## 
##       age  sibsp  parch            ticket     fare cabin embarked  
## 882  33.0      0      0            349257   7.8958   NaN        S  
## 883  22.0      0      0              7552  10.5167   NaN        S  
## 884  28.0      0      0  C.A./SOTON 34068  10.5000   NaN        S  
## 885  25.0      0      0   SOTON/OQ 392076   7.0500   NaN        S  
## 886  39.0      0      5            382652  29.1250   NaN        Q  
## 887  27.0      0      0            211536  13.0000   NaN        S  
## 888  19.0      0      0            112053  30.0000   B42        S  
## 889  21.5      1      2        W./C. 6607  23.4500   NaN        S  
## 890  26.0      0      0            111369  30.0000  C148        C  
## 891  32.0      0      0            370376   7.7500   NaN        Q  
## sex
## female    0.888889
## male      0.343750
## dtype: float64
## age
## over 10     0.366707
## under 10    0.612903
## Name: survived, dtype: float64
## age       pclass
## over 10   1         0.629108
##           2         0.419162
##           3         0.222717
## under 10  1         0.666667
##           2         1.000000
##           3         0.452381
## Name: survived, dtype: float64

Chapter 5 - Case Study (Summer Olympics)

Introduction to the Summer Olympics data and analysis objectives:

  • Olympic medals dataset from 1896 to current - find patterns by countries/medals and the like
  • Indexing, pivoting, pivot_table(), groupby() will all be handy
  • Can use unique() and value_counts() to better understand categorical data and available levels

Understanding the column labels - looking at the Gender and event_gender columns to understand how they are different:

  • Categorical data handling tools such as .value_counts()
  • Boolean processing to assess where values are true or false

Constructing alternative country rankings:

  • Top 5 countries that have won medals in the most sports
  • Medal counts of USA vs USSR for 1952-1988
  • There are two valuable DataFrame methods for finding maxima and minima
    • .idxmax() returns the label where the maximum value is located (much like which.max in R)
    • .idxmin() returns the label where the maximum value is located (much like which.min in R)
    • Including axis=“columns” will run the search along the columns rather than the rows

Reshaping DataFrames for visualization:

  • With plots, the labels come from the index by default
  • Generally, the matplotlib operations work best when there is a single-level index
    • The .unstack() is a form of re-shaping that can help to achieve this

Example code includes:


myPath = "./PythonInputFiles/"


import pandas as pd
import matplotlib.pyplot as plt


# Data is from https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data
# medals is 29216x10 with ['City', 'Edition', 'Sport', 'Discipline', 'Athlete', 'NOC', 'Gender', 'Event', 'Event_gender', 'Medal']
# Downloaded file from Guardian as myPath + "summerOlympics_Medalists_1896_2008.csv" - read file in
medals = pd.read_csv(myPath + "summerOlympics_Medalists_1896_2008.csv", header=4)



USA_edition_grouped = medals.loc[medals.NOC == 'USA'].groupby('Edition')

# Select the 'NOC' column of medals: country_names
country_names = medals["NOC"]

# Count the number of medals won by each country: medal_counts
medal_counts = country_names.value_counts()

# Print top 15 countries ranked by medals
print(medal_counts.head(15))


# Construct the pivot table: counted
counted = medals.pivot_table(index="NOC", columns="Medal", values="Athlete", aggfunc="count")

# Create the new column: counted['totals']
counted['totals'] = counted.sum(axis="columns")

# Sort counted by the 'totals' column
counted = counted.sort_values("totals", ascending=False)

# Print the top 15 rows of counted
print(counted.head(15))


# Select columns: ev_gen
ev_gen = medals[["Event_gender", "Gender"]]

# Drop duplicate pairs: ev_gen_uniques
ev_gen_uniques = ev_gen.drop_duplicates()

# Print ev_gen_uniques
print(ev_gen_uniques)


# Group medals by the two columns: medals_by_gender
medals_by_gender = medals.groupby(['Event_gender', 'Gender'])

# Create a DataFrame with a group count: medal_count_by_gender
medal_count_by_gender = medals_by_gender.count()

# Print medal_count_by_gender
print(medal_count_by_gender)


# Create the Boolean Series: sus
sus = (medals.Event_gender == 'W') & (medals.Gender == 'Men')

# Create a DataFrame with the suspicious row: suspect
suspect = medals.loc[sus, :]

# Print suspect
print(suspect)


# Group medals by 'NOC': country_grouped
country_grouped = medals.groupby("NOC")

# Compute the number of distinct sports in which each country won medals: Nsports
Nsports = country_grouped["Sport"].nunique()

# Sort the values of Nsports in descending order
Nsports = Nsports.sort_values(ascending=False)

# Print the top 15 rows of Nsports
print(Nsports.head(15))


# Extract all rows for which the 'Edition' is between 1952 & 1988: during_cold_war
during_cold_war = (medals["Edition"] >= 1952) & (medals["Edition"] <= 1988)

# Extract rows for which 'NOC' is either 'USA' or 'URS': is_usa_urs
is_usa_urs = medals.NOC.isin(["USA", "URS"])

# Use during_cold_war and is_usa_urs to create the DataFrame: cold_war_medals
cold_war_medals = medals.loc[during_cold_war & is_usa_urs]

# Group cold_war_medals by 'NOC'
country_grouped = cold_war_medals.groupby("NOC")

# Create Nsports
Nsports = country_grouped["Sport"].nunique().sort_values(ascending=False)

# Print Nsports
print(Nsports)


# Create the pivot table: medals_won_by_country
medals_won_by_country = medals.pivot_table(index="Edition", columns="NOC", values="Athlete", aggfunc="count")

# Slice medals_won_by_country: cold_war_usa_usr_medals
cold_war_usa_usr_medals = medals_won_by_country.loc[1952:1988, ["USA", "URS"]]

# Create most_medals 
most_medals = cold_war_usa_usr_medals.idxmax(axis="columns")

# Print most_medals.value_counts()
print(most_medals.value_counts())


# Create the DataFrame: usa
usa = medals.loc[medals["NOC"] == "USA"]

# Group usa by ['Edition', 'Medal'] and aggregate over 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])["Athlete"].count()

# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level="Medal")

# Plot the DataFrame usa_medals_by_year
usa_medals_by_year.plot()
# plt.show()
plt.savefig("_dummyPy070.png", bbox_inches="tight")
plt.clf()


# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']

# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()

# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')

# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area()
# plt.show()
plt.savefig("_dummyPy071.png", bbox_inches="tight")
plt.clf()


# Redefine 'Medal' as an ordered categorical
medals.Medal = pd.Categorical(values=medals.Medal, categories=['Bronze', 'Silver', 'Gold'], ordered=True)

# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']

# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()

# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')

# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area()
# plt.show()
plt.savefig("_dummyPy072.png", bbox_inches="tight")
plt.clf()
## USA    4335
## URS    2049
## GBR    1594
## FRA    1314
## ITA    1228
## GER    1211
## AUS    1075
## HUN    1053
## SWE    1021
## GDR     825
## NED     782
## JPN     704
## CHN     679
## RUS     638
## ROU     624
## Name: NOC, dtype: int64
## Medal  Bronze    Gold  Silver  totals
## NOC                                  
## USA    1052.0  2088.0  1195.0  4335.0
## URS     584.0   838.0   627.0  2049.0
## GBR     505.0   498.0   591.0  1594.0
## FRA     475.0   378.0   461.0  1314.0
## ITA     374.0   460.0   394.0  1228.0
## GER     454.0   407.0   350.0  1211.0
## AUS     413.0   293.0   369.0  1075.0
## HUN     345.0   400.0   308.0  1053.0
## SWE     325.0   347.0   349.0  1021.0
## GDR     225.0   329.0   271.0   825.0
## NED     320.0   212.0   250.0   782.0
## JPN     270.0   206.0   228.0   704.0
## CHN     193.0   234.0   252.0   679.0
## RUS     240.0   192.0   206.0   638.0
## ROU     282.0   155.0   187.0   624.0
##       Event_gender Gender
## 0                M    Men
## 348              X    Men
## 416              W  Women
## 639              X  Women
## 23675            W    Men
##                       City  Edition  Sport  Discipline  Athlete    NOC  Event  \
## Event_gender Gender                                                             
## M            Men     20067    20067  20067       20067    20067  20067  20067   
## W            Men         1        1      1           1        1      1      1   
##              Women    7277     7277   7277        7277     7277   7277   7277   
## X            Men      1653     1653   1653        1653     1653   1653   1653   
##              Women     218      218    218         218      218    218    218   
## 
##                      Medal  
## Event_gender Gender         
## M            Men     20067  
## W            Men         1  
##              Women    7277  
## X            Men      1653  
##              Women     218  
##          City  Edition      Sport Discipline            Athlete  NOC Gender  \
## 23675  Sydney     2000  Athletics  Athletics  CHEPCHUMBA, Joyce  KEN    Men   
## 
##           Event Event_gender   Medal  
## 23675  marathon            W  Bronze  
## NOC
## USA    34
## GBR    31
## FRA    28
## GER    26
## CHN    24
## AUS    22
## ESP    22
## CAN    22
## SWE    21
## URS    21
## ITA    21
## NED    20
## RUS    20
## JPN    20
## DEN    19
## Name: Sport, dtype: int64
## NOC
## URS    21
## USA    20
## Name: Sport, dtype: int64
## URS    8
## USA    2
## dtype: int64

Summer Olympics - USA Medals:

Summer Olympics - USA Medals:

Summer Olympics - USA Medals:

Merging DataFrames with pandas

Chapter 1 - Preparing data

Reading multiple data files - many tools such as pd.read_csv(), pd.read_excel(), pd.read_html(), pd.read_json():

  • Typically, loading multiple files leads to creating multiple pandas DataFrames
  • A typical way to vectorize file reading is with lists and a for loop - dataframes = [] ; for files in myFileList: dataframes.append(pd.read_csv(files))
    • Alterantely, dataframes = [pd.read_csv(files) for files in myFileList] to use list comprehension rather than the FOR loop
  • The glob library can also be helpful to find things like glob(“sales*.csv“) - needs to be preceded with from glob import glob

Reindexing DataFrames - essential for combining DataFrames, since indices are the means by which DataFrames are combined:

  • Can set the indices during pd.read_csv() using the index_col= option
  • Can access the indices using myDF.index
  • Indices can be reordered using a desired list; for example myDF.reindex(myOrderList) will re-index (not performed in place)
    • If the myOrderList contains items that are not in the index for myDF, rows will be created with all values as np.nan
    • If the myOrderList omits items that are in the index for myDF, then those items will be omitted
  • Can also do a straight sort of the index by using myDF.sort_index, which will typically recover the data to how it was on original load to DataFrames
  • Use of myDF.dropna() will remove entire rows that contain np.nan

Arithmetic with Series and DataFrames - generally, scalar operations can be broadcast in Python:

  • Often need to use the .divide() method to run sensible division of DataFrame by DataSeries
    • myDF.divide(mySeries, axis=“rows”) will divde each column of myDF by mySeries
    • More or less, axis=“rows” asks that mySeries be broadcast across the row, so that “a” becomes “a” “a” “a” to match up to the shape of myDF
  • Percentage change (current row vs previous row) can be accessed using myDF.percent_change()
  • When pandas Series are added together, the resulting index will be the union of the respective Series indices
    • However, anything that is not in the index of ALL the underlying Series will come back as NaN
  • mySeriesA + mySeriesB will give the same result as mySeriesA.add(mySeriesB)
    • Can add fill_value=0 to make the NaN in to 0 (the .add() is more flexible than the plus sign)

Example code includes:


myPath = "./PythonInputFiles/"


# Import pandas
import pandas as pd

medals = pd.read_csv(myPath + "summerOlympics_Medalists_1896_2008.csv", header=4)



# Read 'Bronze.csv' into a DataFrame: bronze
# bronze = pd.read_csv("Bronze.csv")
bronze = medals.loc[medals["Medal"] == "Bronze"]

# Read 'Silver.csv' into a DataFrame: silver
# silver = pd.read_csv("Silver.csv")
silver = medals.loc[medals["Medal"] == "Silver"]

# Read 'Gold.csv' into a DataFrame: gold
# gold = pd.read_csv("Gold.csv")
gold = medals.loc[medals["Medal"] == "Gold"]


# Print the first five rows of gold
print(gold.head())


bronze.to_csv(myPath + "olymBronze.csv", index=False)
silver.to_csv(myPath + "olymSilver.csv", index=False)
gold.to_csv(myPath + "olymGold.csv", index=False)


# One time only - for use in next section
# bronze[["NOC", "Athlete"]].groupby("NOC").count().sort_values("Athlete", ascending=False).iloc[0:5, :].to_csv(myPath + "bronze_top5.csv")
# silver[["NOC", "Athlete"]].groupby("NOC").count().sort_values("Athlete", ascending=False).iloc[0:5, :].to_csv(myPath + "silver_top5.csv")
# gold[["NOC", "Athlete"]].groupby("NOC").count().sort_values("Athlete", ascending=False).iloc[0:5, :].to_csv(myPath + "gold_top5.csv")


# Create the list of file names: filenames
filenames = ['olymGold.csv', 'olymSilver.csv', 'olymBronze.csv']

# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(myPath + filename, encoding="latin-1"))

# Print top 5 rows of 1st DataFrame in dataframes
print(dataframes[0].head())


uqNOC = set(list(gold["NOC"].unique()) + list(silver["NOC"].unique()) + list(bronze["NOC"].unique()))

totGold = gold["NOC"].value_counts()
totSilver = silver["NOC"].value_counts()
totBronze = bronze["NOC"].value_counts()

totDF = pd.DataFrame( {"Gold":totGold, "Silver":totSilver, "Bronze":totBronze} ).fillna(0)
totDF["Total"] = totDF["Gold"] + totDF["Silver"] + totDF["Bronze"]
totDF = totDF[["Total", "Gold", "Silver", "Bronze"]]
totDF = totDF.sort_values("Total", ascending=False)
print(totDF.head(20))


# The sole variable is called "Max TemperatureF" with the index being called "Month"
maxTemps = [68, 60, 68, 84, 88, 89, 91, 86, 90, 84, 72, 68]
maxIndex = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']


# Read 'monthly_max_temp.csv' into a DataFrame: weather1
# weather1 = pd.read_csv('monthly_max_temp.csv', index_col="Month")

weather1 = pd.DataFrame( {"Max TemperatureF":maxTemps}, index=maxIndex )

# Print the head of weather1
print(weather1.head())

# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()

# Print the head of weather2
print(weather2.head())

# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending=False)

# Print the head of weather3
print(weather3.head())

# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather1.sort_values("Max TemperatureF")

# Print the head of weather4
print(weather4.head())


# The variable is called "Mean TemperatureF" and the indexing is run by "Month"
# The dataset is then called weather1
meanTemps = [61.956043956043956, 32.133333333333333, 68.934782608695656, 43.434782608695649]
meanIndex = ["Apr", "Jan", "Jul", "Oct"]
year = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']


weather1 = pd.DataFrame( {"Mean TemperatureF":meanTemps}, index=meanIndex )
print(weather1.head())


# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)

# Print weather2
print(weather2)

# Reindex weather1 using the list year with forward-fill: weather3
weather3 = weather1.reindex(year).ffill()

# Print weather3
print(weather3)


# Baby names data is from https://www.data.gov/developers/baby-names-dataset/

yob1881 = pd.read_csv(myPath + "yob1881.txt", header=None)
yob1981 = pd.read_csv(myPath + "yob1981.txt", header=None)

yob1881.columns = ["Name", "Gender", "Count"]
yob1981.columns = ["Name", "Gender", "Count"]

yob1881 = yob1881.set_index("Name").sort_values("Count", ascending=False)
yob1981 = yob1981.set_index("Name").sort_values("Count", ascending=False)

print(yob1881.shape)
print(yob1981.shape)
print(yob1881.head(12))
print(yob1981.head(12))


# Reindex names_1981 with index of names_1881: common_names
# Take only top-200 names by year
pop1881 = yob1881.iloc[0:200, :]
pop1981 = yob1981.iloc[0:200, :]


common_names = pop1981.reindex(pop1881.index)

# Print shape of common_names
print(common_names.shape)
print(common_names.head(12))

# Drop rows with null counts: common_names
common_names = common_names.dropna()

# Print shape of new common_names
print(common_names.shape)
print(common_names.head(12))


# weather is 365x22 representing 2013 Pittsburgh weather data from Weather Underground
# Used package "weatherData" to grab this from R
# KPIT2013 <- weatherData::getWeatherForDate("KPIT", "2013-01-01", "2013-12-31", opt_all_columns = TRUE)
# write.csv(KPIT2013, "./PythonInputFiles/KPIT2013.csv", row.names=FALSE)

weather = pd.read_csv(myPath + "KPIT2013.csv")

# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min_TemperatureF', 'Mean_TemperatureF', 'Max_TemperatureF']]

# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * (5/9)

# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns = temps_c.columns.str.replace("F", "C")

# Print first 5 rows of temps_c
print(temps_c.head())


# Quarterly US GDP data from 1947-01-01 to 2016-04-01
# Downloaded from https://fred.stlouisfed.org/series/GDP as myPath + "US_GDP_1947_2016_StLouisFRED.csv"
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv(myPath + "US_GDP_1947_2016_StLouisFRED.csv", parse_dates=True, index_col="DATE")

# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc["2008-01-01":, :]

# Print the last 8 rows of post2008
print(post2008.tail(8))

# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample("A").last()

# Print yearly
print(yearly)

# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change()*100

# Print yearly again
print(yearly)


# Import pandas
# import pandas as pd

# Read 'sp500.csv' into a DataFrame: sp500
# sp500 = pd.read_csv("sp500.csv", parse_dates=True, index_col="Date")

# Read 'exchange.csv' into a DataFrame: exchange
# exchange = pd.read_csv("exchange.csv", parse_dates=True, index_col="Date")

# Subset 'Open' & 'Close' columns from sp500: dollars
# dollars = sp500.loc[:, ["Open", "Close"]]

# Print the head of dollars
# print(dollars.head())

# Convert dollars to pounds: pounds
# pounds = dollars.multiply(exchange["GBP/USD"], axis="rows")

# Print the head of pounds
# print(pounds.head())
##       City  Edition      Sport Discipline             Athlete  NOC Gender  \
## 0   Athens     1896   Aquatics   Swimming       HAJOS, Alfred  HUN    Men   
## 3   Athens     1896   Aquatics   Swimming  MALOKINIS, Ioannis  GRE    Men   
## 6   Athens     1896   Aquatics   Swimming       HAJOS, Alfred  HUN    Men   
## 9   Athens     1896   Aquatics   Swimming       NEUMANN, Paul  AUT    Men   
## 13  Athens     1896  Athletics  Athletics       BURKE, Thomas  USA    Men   
## 
##                          Event Event_gender Medal  
## 0               100m freestyle            M  Gold  
## 3   100m freestyle for sailors            M  Gold  
## 6              1200m freestyle            M  Gold  
## 9               400m freestyle            M  Gold  
## 13                        100m            M  Gold  
##      City  Edition      Sport Discipline             Athlete  NOC Gender  \
## 0  Athens     1896   Aquatics   Swimming       HAJOS, Alfred  HUN    Men   
## 1  Athens     1896   Aquatics   Swimming  MALOKINIS, Ioannis  GRE    Men   
## 2  Athens     1896   Aquatics   Swimming       HAJOS, Alfred  HUN    Men   
## 3  Athens     1896   Aquatics   Swimming       NEUMANN, Paul  AUT    Men   
## 4  Athens     1896  Athletics  Athletics       BURKE, Thomas  USA    Men   
## 
##                         Event Event_gender Medal  
## 0              100m freestyle            M  Gold  
## 1  100m freestyle for sailors            M  Gold  
## 2             1200m freestyle            M  Gold  
## 3              400m freestyle            M  Gold  
## 4                        100m            M  Gold  
##       Total    Gold  Silver  Bronze
## USA  4335.0  2088.0  1195.0  1052.0
## URS  2049.0   838.0   627.0   584.0
## GBR  1594.0   498.0   591.0   505.0
## FRA  1314.0   378.0   461.0   475.0
## ITA  1228.0   460.0   394.0   374.0
## GER  1211.0   407.0   350.0   454.0
## AUS  1075.0   293.0   369.0   413.0
## HUN  1053.0   400.0   308.0   345.0
## SWE  1021.0   347.0   349.0   325.0
## GDR   825.0   329.0   271.0   225.0
## NED   782.0   212.0   250.0   320.0
## JPN   704.0   206.0   228.0   270.0
## CHN   679.0   234.0   252.0   193.0
## RUS   638.0   192.0   206.0   240.0
## ROU   624.0   155.0   187.0   282.0
## CAN   592.0   154.0   211.0   227.0
## NOR   537.0   194.0   199.0   144.0
## POL   499.0   103.0   173.0   223.0
## DEN   491.0   147.0   192.0   152.0
## FRG   490.0   143.0   167.0   180.0
##      Max TemperatureF
## Jan                68
## Feb                60
## Mar                68
## Apr                84
## May                88
##      Max TemperatureF
## Apr                84
## Aug                86
## Dec                68
## Feb                60
## Jan                68
##      Max TemperatureF
## Sep                90
## Oct                84
## Nov                72
## May                88
## Mar                68
##      Max TemperatureF
## Feb                60
## Jan                68
## Mar                68
## Dec                68
## Nov                72
##      Mean TemperatureF
## Apr          61.956044
## Jan          32.133333
## Jul          68.934783
## Oct          43.434783
##      Mean TemperatureF
## Jan          32.133333
## Feb                NaN
## Mar                NaN
## Apr          61.956044
## May                NaN
## Jun                NaN
## Jul          68.934783
## Aug                NaN
## Sep                NaN
## Oct          43.434783
## Nov                NaN
## Dec                NaN
##      Mean TemperatureF
## Jan          32.133333
## Feb          32.133333
## Mar          32.133333
## Apr          61.956044
## May          61.956044
## Jun          61.956044
## Jul          68.934783
## Aug          68.934783
## Sep          68.934783
## Oct          43.434783
## Nov          43.434783
## Dec          43.434783
## (1935, 2)
## (19471, 2)
##         Gender  Count
## Name                 
## John         M   8769
## William      M   8524
## Mary         F   6919
## James        M   5441
## George       M   4664
## Charles      M   4636
## Frank        M   2834
## Anna         F   2698
## Joseph       M   2456
## Henry        M   2339
## Thomas       M   2282
## Edward       M   2177
##             Gender  Count
## Name                     
## Michael          M  68765
## Jennifer         F  57046
## Christopher      M  50228
## Matthew          M  43324
## Jessica          F  42530
## Jason            M  41926
## David            M  40647
## Joshua           M  39054
## James            M  38307
## John             M  34881
## Robert           M  34396
## Amanda           F  34372
## (200, 2)
##         Gender    Count
## Name                   
## John         M  34881.0
## William      M  24803.0
## Mary         F  11040.0
## James        M  38307.0
## George       M   5159.0
## Charles      M  14428.0
## Frank        M   3637.0
## Anna         F   5189.0
## Joseph       M  30771.0
## Henry      NaN      NaN
## Thomas       M  17165.0
## Edward       M   6657.0
## (42, 2)
##         Gender    Count
## Name                   
## John         M  34881.0
## William      M  24803.0
## Mary         F  11040.0
## James        M  38307.0
## George       M   5159.0
## Charles      M  14428.0
## Frank        M   3637.0
## Anna         F   5189.0
## Joseph       M  30771.0
## Thomas       M  17165.0
## Edward       M   6657.0
## Robert       M  34396.0
##    Min_TemperatureC  Mean_TemperatureC  Max_TemperatureC
## 0         -6.111111          -2.777778          0.000000
## 1        -10.000000          -6.666667         -3.888889
## 2        -14.444444          -6.666667          0.555556
## 3         -3.333333          -1.666667          0.000000
## 4         -4.444444          -1.111111          1.666667
##                 GDP
## DATE               
## 2015-04-01  17998.3
## 2015-07-01  18141.9
## 2015-10-01  18222.8
## 2016-01-01  18281.6
## 2016-04-01  18450.1
## 2016-07-01  18675.3
## 2016-10-01  18869.4
## 2017-01-01  19027.6
##                 GDP
## DATE               
## 2008-12-31  14549.9
## 2009-12-31  14566.5
## 2010-12-31  15230.2
## 2011-12-31  15785.3
## 2012-12-31  16297.3
## 2013-12-31  16999.9
## 2014-12-31  17692.2
## 2015-12-31  18222.8
## 2016-12-31  18869.4
## 2017-12-31  19027.6
##                 GDP    growth
## DATE                         
## 2008-12-31  14549.9       NaN
## 2009-12-31  14566.5  0.114090
## 2010-12-31  15230.2  4.556345
## 2011-12-31  15785.3  3.644732
## 2012-12-31  16297.3  3.243524
## 2013-12-31  16999.9  4.311144
## 2014-12-31  17692.2  4.072377
## 2015-12-31  18222.8  2.999062
## 2016-12-31  18869.4  3.548302
## 2017-12-31  19027.6  0.838394

Chapter 2 - Concatenating Data

Appending and concatenating Series - using .append() or pd.concat():

  • When invoked as DF1.append(DF2), the rows of DF2 will be placed beneath DF1
    • This method will also work with Series, in addition to DataFrames
    • The method .reset_index(drop=True) will create a new index and also delete the old indices (that is what the drop=True commands)
  • Alternately, pd.concat(DF1, DF2, DF3) can be used to concatenate the data
    • This method can be run for rows (stacked data) or columns
    • The option ignore_index=True will create a new index for the concatenated data
  • The appended data may have duplicates in the index, which is permissible but frequently undesirable
    • The .reset_index(drop=True) or ignore_index=True are best practices for obtaining a unique index

Appending and concatenating DataFrames:

  • If the data have different columns, then the stacking still occurs but with np.nan coerced in for missing values due to that not being part of the underlying row data
  • If the data have different index names, the data are still stacked under each other, but the index becomes un-named
  • Using the command axis=1 or axis=“columns” inside of pd.concat() is a request for the columns to be placed to the right of the existing data rather than for the rows to be placed underneath it
    • In this case, matching indices will lead to full data, while mismatched indices will fill with the appropriate amounts of np.nan

Concatenation, keys, and MultiIndexes:

  • If using the keys=[] option inside pd.concat(), then an extra outer index will be created, with the items in keys corresponding to the DataFrames in the pd.concat() list
  • If concatenating using axis=1 / axis=“columns”, then there can be multiple columns with the same name
    • The keys=[] work-around works here also, and the axis=1 means that they outer key will be placed on the columns rather than on the rows
  • If a dictionary is sent as the input to pd.concat({}), then the dictionary keys become the outer keys

Outer and Inner Joins:

  • If using numpy, np.hstack() will stack horizontally and np.vstack() will stack vertically
    • Can instead use np.concatenate([], axis=0/1) where axis=0 is the vstack and axis=1 is the hstack
  • Joins are the process of combining rows of multiple tables in a meaningful manner
    • Outer joins are similar to the work above, where everything is kept with np.nan inserted as needed due to index mismatch
    • Inner joins keep only the rows where the indices are common to both tables
    • The option join=“inner” can be included inside the pd.concat() call # the join=“outer” is the default and can be excluded

Example code includes:


myPath = "./PythonInputFiles/"



import pandas as pd
import numpy as np
import random

# Do not have these .csv files
# Created dummy data and saved .csv to myPath
# keyDates = pd.date_range("2015-01-01", "2015-03-31")
# utHardware = [random.randint(2, 10) for p in range(len(keyDates))]
# utSoftware = [random.randint(1, 50) for p in range(len(keyDates))]
# utService = [random.randint(0, 200) for p in range(len(keyDates))]
# totSales = pd.DataFrame( {"Date":[str(x).split()[0] for x in keyDates], "Hardware":utHardware, "Software":utSoftware, "Service":utService } )
# totSales["Units"] = totSales["Hardware"] + totSales["Software"] + totSales["Service"]
# totSales["Company"] = ["A", "B", "C"] * 30
# totSales.iloc[:31, :].to_csv(myPath + "sales-jan-2015.csv", index=False)
# totSales.iloc[31:59, :].to_csv(myPath + "sales-feb-2015.csv", index=False)
# totSales.iloc[59:, :].to_csv(myPath + "sales-mar-2015.csv", index=False)


# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv(myPath + "sales-jan-2015.csv", parse_dates=True, index_col="Date")

# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv(myPath + "sales-feb-2015.csv", parse_dates=True, index_col="Date")

# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv(myPath + "sales-mar-2015.csv", parse_dates=True, index_col="Date")

# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']

# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']

# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)

# Print the first slice from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])

# Print the second slice from quarter1
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

# Compute & print total sales in quarter1
print(quarter1.sum())


# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month["Units"])

# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis="rows")

# Print slices from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])


# Refers back to the names datasets from earlier in these chapters
yob1881 = pd.read_csv(myPath + "yob1881.txt", header=None)
yob1981 = pd.read_csv(myPath + "yob1981.txt", header=None)

yob1881.columns = ["Name", "Gender", "Count"]
yob1981.columns = ["Name", "Gender", "Count"]

names_1881 = yob1881.sort_values("Count", ascending=False)
names_1981 = yob1981.sort_values("Count", ascending=False)


# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981


# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index=True)

# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

# Print all rows that contain the name 'Morgan'
print(combined_names.loc[combined_names["Name"].str.contains("Morgan"), :])


# These data are the 4x1 of quarterly data from above in this workbook (Mean is actually the 12x1 with Max being the 4x1)
# The sole variable is called "Max TemperatureF" with the index being called "Month"
maxTemps = [68, 60, 68, 84, 88, 89, 91, 86, 90, 84, 72, 68]
maxIndex = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
meanTemps = [61.956043956043956, 32.133333333333333, 68.934782608695656, 43.434782608695649]
meanIndex = ["Apr", "Jan", "Jul", "Oct"]

weather_max = pd.DataFrame( {"Max TemperatureF":maxTemps}, index=maxIndex)
weather_mean = pd.DataFrame( {"Mean TemperatureF":meanTemps}, index=meanIndex)


# Concatenate weather_max and weather_mean horizontally: weather
weather = pd.concat([weather_max, weather_mean], axis=1).reindex(weather_max.index)

# Print weather
print(weather)


# This uses the Olympics medal datasets from previous

medal_types = ['bronze', 'silver', 'gold']
medals = []

for medal in medal_types:
    # Create the file name: file_name
    file_name = myPath + "%s_top5.csv" % medal  # Note that the %s followed later by % medal means to replace the %s with the value of medal
    
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv(file_name, header=0, index_col="Country", names=columns)
    
    # Append medal_df to medals
    medals.append(medal_df)

# Concatenate medals horizontally: medals
medals = pd.concat(medals, axis="columns")

# Print medals
print(medals)


medals = []

for medal in medal_types:
    file_name = myPath + "%s_top5.csv" % medal
    
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name, index_col="NOC")
    
    # Append medal_df to medals
    medals.append(medal_df)
    
# Concatenate medals: medals
medals = pd.concat(medals, keys=['bronze', 'silver', 'gold'])

# Print medals in entirety
print(medals)


# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)

# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','GER')])

# Print data about silver medals
print(medals_sorted.loc['silver'])

# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
print(medals_sorted.loc[idx[:,'GBR'], :])


# DO NOT HAVE THESE FILES - PROBABLY LINKED TO THE "sales" INPUTS FROM ABOVE
# Concatenate dataframes: february
# february = pd.concat(dataframes, axis=1, keys=['Hardware', 'Software', 'Service'])

# Print february.info()
# print(february.info())

# Assign pd.IndexSlice: idx
# idx = pd.IndexSlice

# Create the slice: slice_2_8
# slice_2_8 = february.loc['2015-02-02':'2015-02-08', idx[:, 'Company']]

# Print slice_2_8
# print(slice_2_8)


# CONTINUES TO BE jan/feb/mar FROM PREVIOUS "sales" INPUTS
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]

# Create an empty dictionary: month_dict
month_dict = {}

for month_name, month_data in month_list:
    
    # Group month_data: month_dict[month_name]
    month_dict[month_name] = month_data.groupby("Company").sum()

# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)

# Print sales
print(sales)

# Print all sales by 'A'
idx = pd.IndexSlice
print(sales.loc[idx[:, 'A'], :])


# Again, the Olympics datasets (specifically, top-5 by medal type)
bronze_top5=pd.read_csv(myPath + "bronze_top5.csv", index_col="NOC")
silver_top5=pd.read_csv(myPath + "silver_top5.csv", index_col="NOC")
gold_top5=pd.read_csv(myPath + "gold_top5.csv", index_col="NOC")

# Create the list of DataFrames: medal_list
medal_list = [bronze_top5, silver_top5, gold_top5]

# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, axis=1, join="inner", keys=['bronze', 'silver', 'gold'])
medals.columns = ['bronze', 'silver', 'gold']

# Print medals
print(medals)


# US is quartely GDP starting 1947
# China is annual GDP starting 1966

# Resample and tidy china: china_annual
# china_annual = china.resample("A").pct_change(10).dropna()

# Resample and tidy us: us_annual
# us_annual = us.resample("A").pct_change(10).dropna()

# Concatenate china_annual and us_annual: gdp
# gdp = pd.concat([china_annual, us_annual], join="inner", axis=1)

# Resample gdp and print
# print(gdp.resample('10A').last())
## Date
## 2015-01-27    200
## 2015-01-28    223
## 2015-01-29    176
## 2015-01-30    124
## 2015-01-31    116
## 2015-02-01    116
## 2015-02-02    168
## Name: Units, dtype: int64
## Date
## 2015-02-26    234
## 2015-02-27    203
## 2015-02-28    118
## 2015-03-01    136
## 2015-03-02     31
## 2015-03-03    191
## 2015-03-04     80
## 2015-03-05     38
## 2015-03-06    111
## 2015-03-07    129
## Name: Units, dtype: int64
## 11979
## Date
## 2015-01-27    200
## 2015-01-28    223
## 2015-01-29    176
## 2015-01-30    124
## 2015-01-31    116
## 2015-02-01    116
## 2015-02-02    168
## Name: Units, dtype: int64
## Date
## 2015-02-26    234
## 2015-02-27    203
## 2015-02-28    118
## 2015-03-01    136
## 2015-03-02     31
## 2015-03-03    191
## 2015-03-04     80
## 2015-03-05     38
## 2015-03-06    111
## 2015-03-07    129
## Name: Units, dtype: int64
## (19471, 4)
## (1935, 4)
## (21406, 4)
##            Name Gender  Count  year
## 680      Morgan      M     23  1881
## 2249     Morgan      F   1769  1981
## 2521     Morgan      M    766  1981
## 10117   Morgana      F     14  1981
## 13078   Morgann      F      9  1981
## 19844  Morganne      F      5  1981
##      Max TemperatureF  Mean TemperatureF
## Jan                68          32.133333
## Feb                60                NaN
## Mar                68                NaN
## Apr                84          61.956044
## May                88                NaN
## Jun                89                NaN
## Jul                91          68.934783
## Aug                86                NaN
## Sep                90                NaN
## Oct                84          43.434783
## Nov                72                NaN
## Dec                68                NaN
##      bronze  silver    gold
## FRA   475.0   461.0     NaN
## GBR   505.0   591.0   498.0
## GER   454.0     NaN   407.0
## ITA     NaN   394.0   460.0
## URS   584.0   627.0   838.0
## USA  1052.0  1195.0  2088.0
##             Athlete
##        NOC         
## bronze USA     1052
##        URS      584
##        GBR      505
##        FRA      475
##        GER      454
## silver USA     1195
##        URS      627
##        GBR      591
##        FRA      461
##        ITA      394
## gold   USA     2088
##        URS      838
##        GBR      498
##        ITA      460
##        GER      407
## Athlete    454
## Name: (bronze, GER), dtype: int64
##      Athlete
## NOC         
## FRA      461
## GBR      591
## ITA      394
## URS      627
## USA     1195
##             Athlete
##        NOC         
## bronze GBR      505
## gold   GBR      498
## silver GBR      591
##                   Hardware  Service  Software  Units
##          Company                                    
## february A              47      986       210   1243
##          B              70     1092       242   1404
##          C              41      966       189   1196
## january  A              72     1133       252   1457
##          B              68     1117       188   1373
##          C              50     1037       277   1364
## march    A              66      667       247    980
##          B              56     1137       303   1496
##          C              65     1139       262   1466
##                   Hardware  Service  Software  Units
##          Company                                    
## february A              47      986       210   1243
## january  A              72     1133       252   1457
## march    A              66      667       247    980
##      bronze  silver  gold
## NOC                      
## USA    1052    1195  2088
## URS     584     627   838
## GBR     505     591   498

Chapter 3 - Merging Data

Merging DataFrames - an extension of concatenation that allows for merging on things other than the index:

  • Can use pd.merge(DF1, DF2) to merge on all the matching columns, defaulted to an inner join
    • Adding on=[“”] will allow for merging to take place only on the specified column(s), with any other duplicated column names taking on _x and _y suffixes
    • Can add suffixes=[“”] to replace _x and _y with the specified suffixes for the new variable names
    • Can instead specify left_on=[“”] and right_on=[“”] to specify that differently named columns in the first and second DataFrame should be used for the merge

Joining DataFrames - various types of joins, and implications on processing efficency:

  • The default for pd.merge() is an implied how=“inner” argument
    • The how=“left” option will keep everything from the left dataset and only the matches from the right (non-matched data will be null-filled)
    • The how=“right” option will keep everything from the right dataset and only the matches from the left (non-matched data will be null-filled)
    • The how=“outer” will keep everything from either dataset
  • When using myDF.join(DF2), there is a default how=“left” assumption such that everything in myDF will be kept, along with matching data from DF2
    • This can be over-ridden by specifying the how= as “right” or “inner” or “outer”
  • Suggestions for data-combining techniques
    • df1.append(df2) works fine for simple stacking vertically
    • pd.concat([df1, df2]) adds flexibility, including the ability to stack horizontally and inner/outer joins
    • df1.join(df2) expands to allow left/right joins in addition to inner/outer
    • pd.merge([df1, df2]) adds the customization of multiple columns, mismatched column names, and the like

Ordered merges - DataFrames where the underlying data has a natural order (such as time series data):

  • The pd.merge_ordered() call will default to an outer join that sorts by the first columns of the combined database
    • Can specify on=[“”] to define the columns to be merged
    • Can specify fill_method=“ffill” to forward-fill on any np.nan that would otherwise be generated

Example code includes:


myPath = "./PythonInputFiles/"


import pandas as pd


revenue = pd.DataFrame({"branch_id" : [10, 20, 30, 47] , "city" : ["Austin", "Denver", "Springfield", "Mendocino"] , "revenue" : [100, 83, 4, 200] } )
managers = pd.DataFrame({"branch_id" : [10, 20, 47, 31] , "city" : ["Austin", "Denver", "Mendocino", "Springfield"] , "manager" : ["Charles", "Joel", "Brett", "Sally"] } )


# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue, managers, on="city")

# Print merge_by_city
print(merge_by_city)

# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue, managers, on="branch_id")

# Print merge_by_id
print(merge_by_id)


revenue["state"] = ["TX", "CO", "IL", "CA"]
managers["state"] = ["TX", "CO", "CA", "MO"]

managers=managers.iloc[:, [1, 0, 2, 3]]
managers.columns = ["branch", "branch_id", "manager", "state"]

# Merge revenue & managers on 'city' & 'branch': combined
combined = pd.merge(revenue, managers, left_on="city", right_on="branch")

# Print combined
print(combined)


# Add 'state' column to revenue: revenue['state']
# revenue['state'] = ['TX','CO','IL','CA']  # already handled above

# Add 'state' column to managers: managers['state']
# managers['state'] = ['TX','CO','CA','MO']  # already handled above


managers = managers.iloc[:, [1, 0, 2, 3]]   # get back to how it was
managers.columns = ["branch_id", "city", "manager", "state"]

# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue, managers, on=["branch_id", "city", "state"])

# Print combined
print(combined)


sales = pd.DataFrame( { "city" : ["Mendocino", "Denver", "Austin", "Springield", "Springfield"] , "state" : ["CA", "CO", "TX", "MO", "IL"] , "units" : [1, 4, 2, 5, 1] } )
managers=managers.iloc[:, [1, 0, 2, 3]]
managers.columns = ["branch", "branch_id", "manager", "state"]


# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue, sales, how="right", on=['city', 'state'])

# Print revenue_and_sales
print(revenue_and_sales)

# Merge sales and managers: sales_and_managers
sales_and_managers = pd.merge(sales, managers, how="left", left_on=['city', 'state'], right_on=['branch', 'state'])

# Print sales_and_managers
print(sales_and_managers)


# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers, revenue_and_sales)

# Print merge_default
print(merge_default)

# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers, revenue_and_sales, how="outer")

# Print merge_outer
print(merge_outer)

# Perform the third merge: merge_outer_on
merge_outer_on = pd.merge(sales_and_managers, revenue_and_sales, on=['city','state'], how="outer")

# Print merge_outer_on
print(merge_outer_on)


austin = pd.DataFrame( { "date":pd.to_datetime(["2016-01-01", "2016-02-08", "2016-01-17"]), "ratings" : ["Cloudy", "Cloudy", "Sunny"] } )
houston = pd.DataFrame( { "date":pd.to_datetime(["2016-01-04", "2016-01-01", "2016-03-01"]), "ratings" : ["Rainy", "Cloudy", "Sunny"] } )

# Perform the first ordered merge: tx_weather
tx_weather = pd.merge_ordered(austin, houston)

# Print tx_weather
print(tx_weather)

# Perform the second ordered merge: tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin, houston, on="date", suffixes=['_aus','_hus'])

# Print tx_weather_suff
print(tx_weather_suff)

# Perform the third ordered merge: tx_weather_ffill
tx_weather_ffill = pd.merge_ordered(austin, houston, on="date", suffixes=['_aus','_hus'], fill_method="ffill")

# Print tx_weather_ffill
print(tx_weather_ffill)


# Similar to pd.merge_ordered(), the pd.merge_asof() function will also merge values in order using the on column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept.

# DO NOT HAVE THESE DATASETS
# Merge auto and oil: merged
# merged = pd.merge_asof(auto, oil, left_on="yr", right_on="Date")

# Print the tail of merged
# print(merged.tail())

# Resample merged: yearly
# yearly = merged.resample("A", on="Date")[['mpg','Price']].mean()

# Print yearly
# print(yearly)

# print yearly.corr()
# print(yearly.corr())
##    branch_id_x         city  revenue  branch_id_y  manager
## 0           10       Austin      100           10  Charles
## 1           20       Denver       83           20     Joel
## 2           30  Springfield        4           31    Sally
## 3           47    Mendocino      200           47    Brett
##    branch_id     city_x  revenue     city_y  manager
## 0         10     Austin      100     Austin  Charles
## 1         20     Denver       83     Denver     Joel
## 2         47  Mendocino      200  Mendocino    Brett
##    branch_id_x         city  revenue state_x       branch  branch_id_y  \
## 0           10       Austin      100      TX       Austin           10   
## 1           20       Denver       83      CO       Denver           20   
## 2           30  Springfield        4      IL  Springfield           31   
## 3           47    Mendocino      200      CA    Mendocino           47   
## 
##    manager state_y  
## 0  Charles      TX  
## 1     Joel      CO  
## 2    Sally      MO  
## 3    Brett      CA  
##    branch_id       city  revenue state  manager
## 0         10     Austin      100    TX  Charles
## 1         20     Denver       83    CO     Joel
## 2         47  Mendocino      200    CA    Brett
##    branch_id         city  revenue state  units
## 0       10.0       Austin    100.0    TX      2
## 1       20.0       Denver     83.0    CO      4
## 2       30.0  Springfield      4.0    IL      1
## 3       47.0    Mendocino    200.0    CA      1
## 4        NaN   Springield      NaN    MO      5
##           city state  units     branch  branch_id  manager
## 0    Mendocino    CA      1  Mendocino       47.0    Brett
## 1       Denver    CO      4     Denver       20.0     Joel
## 2       Austin    TX      2     Austin       10.0  Charles
## 3   Springield    MO      5        NaN        NaN      NaN
## 4  Springfield    IL      1        NaN        NaN      NaN
##          city state  units     branch  branch_id  manager  revenue
## 0   Mendocino    CA      1  Mendocino       47.0    Brett    200.0
## 1      Denver    CO      4     Denver       20.0     Joel     83.0
## 2      Austin    TX      2     Austin       10.0  Charles    100.0
## 3  Springield    MO      5        NaN        NaN      NaN      NaN
##           city state  units     branch  branch_id  manager  revenue
## 0    Mendocino    CA      1  Mendocino       47.0    Brett    200.0
## 1       Denver    CO      4     Denver       20.0     Joel     83.0
## 2       Austin    TX      2     Austin       10.0  Charles    100.0
## 3   Springield    MO      5        NaN        NaN      NaN      NaN
## 4  Springfield    IL      1        NaN        NaN      NaN      NaN
## 5  Springfield    IL      1        NaN       30.0      NaN      4.0
##           city state  units_x     branch  branch_id_x  manager  branch_id_y  \
## 0    Mendocino    CA        1  Mendocino         47.0    Brett         47.0   
## 1       Denver    CO        4     Denver         20.0     Joel         20.0   
## 2       Austin    TX        2     Austin         10.0  Charles         10.0   
## 3   Springield    MO        5        NaN          NaN      NaN          NaN   
## 4  Springfield    IL        1        NaN          NaN      NaN         30.0   
## 
##    revenue  units_y  
## 0    200.0        1  
## 1     83.0        4  
## 2    100.0        2  
## 3      NaN        5  
## 4      4.0        1  
##         date ratings
## 0 2016-01-01  Cloudy
## 1 2016-01-04   Rainy
## 2 2016-01-17   Sunny
## 3 2016-02-08  Cloudy
## 4 2016-03-01   Sunny
##         date ratings_aus ratings_hus
## 0 2016-01-01      Cloudy      Cloudy
## 1 2016-01-04         NaN       Rainy
## 2 2016-01-17       Sunny         NaN
## 3 2016-02-08      Cloudy         NaN
## 4 2016-03-01         NaN       Sunny
##         date ratings_aus ratings_hus
## 0 2016-01-01      Cloudy      Cloudy
## 1 2016-01-04      Cloudy       Rainy
## 2 2016-01-17       Sunny       Rainy
## 3 2016-02-08      Cloudy       Rainy
## 4 2016-03-01      Cloudy       Sunny

Chapter 4 - Case Study (Summer Olympics)

Medals in the Summer Olympics - does a country win more medals when it is the host?:

  • Load and combine underlying .csv files from the Guardian

Quantifying Performance:

  • Using a .pivot_table(index=, values=, columns=, aggfunc=) to define “success” for each country’s athletes
  • Need to calculate fractions (percentage of total medals), and potentially zero-fill the NA data

Reshaping and plotting:

  • Melting the data to be easier to work with
  • Merging in the host country information
  • Quantifying “home country” influence, and then plotting the findings

Example code includes:


myPath = "./PythonInputFiles/"



import pandas as pd
import matplotlib.pyplot as plt


# Create files needed for reading in later
# medals = pd.read_csv(myPath + "summerOlympics_Medalists_1896_2008.csv", header=4)
# uqYears = medals["Edition"].value_counts().sort_index().index
# for x in uqYears: 
#     outFile = myPath + '_notuse_summer_{:d}.csv'.format(x)
#     outData = medals.loc[medals["Edition"] == x]
#     outData.to_csv(outFile, index=False)
# 

# Create file path: file_path
file_path = myPath + "summerOlympics_Hosts_1896_2008.txt"

# Load DataFrame from file_path: editions
editions = pd.read_csv(file_path, sep="\t")

# Extract the relevant columns: editions
editions = editions[['Edition', 'Grand Total', 'City', 'Country']]

# Print editions DataFrame
print(editions)


# Create the file path: file_path
file_path = myPath + 'olympicsCountryCodes.csv'

# Load DataFrame from file_path: ioc_codes
ioc_codes = pd.read_csv(file_path)
ioc_codes.columns = ["Country", "NOC", "ISO", "Country_1"]

# Extract the relevant columns: ioc_codes
ioc_codes = ioc_codes[["Country", "NOC"]]

# Print first and last 5 rows of ioc_codes
print(ioc_codes.head())
print(ioc_codes.tail())


# Create empty dictionary: medals_dict
medals_dict = {}

for year in editions['Edition']:
    
    # Create the file path: file_path
    file_path = myPath + '_notuse_summer_{:d}.csv'.format(year)
    
    # Load file_path into a DataFrame: medals_dict[year]
    medals_dict[year] = pd.read_csv(file_path, encoding="latin-1")
    
    # Extract relevant columns: medals_dict[year]
    medals_dict[year] = medals_dict[year][['Athlete', 'NOC', 'Medal']]
    
    # Assign year to column 'Edition' of medals_dict
    medals_dict[year]['Edition'] = year


# Concatenate medals_dict: medals
medals = pd.concat(medals_dict, ignore_index=True)

# Print first and last 5 rows of medals
print(medals.head())
print(medals.tail())


# Construct the pivot_table: medal_counts
medal_counts = medals.pivot_table(index="Edition", columns="NOC", values="Athlete", aggfunc="count")

# Print the first & last 5 rows of medal_counts
print(medal_counts.head())
print(medal_counts.tail())


# Set Index of editions: totals
totals = editions.set_index("Edition")

# Reassign totals['Grand Total']: totals
totals = totals["Grand Total"]

# Divide medal_counts by totals: fractions
fractions = medal_counts.divide(totals, axis="rows")

# Print first & last 5 rows of fractions
print(fractions.head())
print(fractions.tail())


# CHECK IN TO WHAT THE .expanding() does here . . . 
# Apply the expanding mean: mean_fractions
mean_fractions = fractions.expanding().mean()

# Compute the percentage change: fractions_change
fractions_change = mean_fractions.pct_change() * 100

# Reset the index of fractions_change: fractions_change
fractions_change = fractions_change.reset_index()

# Print first & last 5 rows of fractions_change
print(fractions_change.head())
print(fractions_change.tail())


# Left join editions and ioc_codes: hosts
hosts = pd.merge(editions, ioc_codes, how="left")

# Extract relevant columns and set index: hosts
hosts = hosts[["Edition", "NOC"]].set_index("Edition")

# Fix missing 'NOC' values of hosts
print(hosts.loc[hosts.NOC.isnull()])
hosts.loc[1972, 'NOC'] = 'FRG'
hosts.loc[1980, 'NOC'] = 'URS'
hosts.loc[1988, 'NOC'] = 'KOR'

# Reset Index of hosts: hosts
hosts = hosts.reset_index()

# Print hosts
print(hosts)


# Reshape fractions_change: reshaped
reshaped = pd.melt(fractions_change, id_vars="Edition", value_name="Change")

# Print reshaped.shape and fractions_change.shape
print(reshaped.shape, fractions_change.shape)

# Extract rows from reshaped where 'NOC' == 'CHN': chn
chn = reshaped[reshaped["NOC"] == "CHN"]

# Print last 5 rows of chn with .tail()
print(chn.tail())


# Merge reshaped and hosts: merged
merged = pd.merge(reshaped, hosts, how="inner")

# Print first 5 rows of merged
print(merged.head())

# Set Index of merged and sort it: influence
influence = merged.set_index("Edition").sort_index()

# Print first 5 rows of influence
print(influence.head())


# Import pyplot
import matplotlib.pyplot as plt

# Extract influence['Change']: change
change = influence["Change"]

# Make bar plot of change: ax
ax = change.plot(kind="bar")

# Customize the plot to improve readability
ax.set_ylabel("% Change of Host Country Medal Count")
ax.set_title("Is there a Host Country Advantage?")
ax.set_xticklabels(editions['City'])

# Display the plot
# plt.show()
plt.savefig("_dummyPy073.png", bbox_inches="tight")
plt.clf()
##     Edition  Grand Total         City                     Country
## 0      1896          151       Athens                      Greece
## 1      1900          512        Paris                      France
## 2      1904          470    St. Louis               United States
## 3      1908          804       London              United Kingdom
## 4      1912          885    Stockholm                      Sweden
## 5      1920         1298      Antwerp                     Belgium
## 6      1924          884        Paris                      France
## 7      1928          710    Amsterdam                 Netherlands
## 8      1932          615  Los Angeles               United States
## 9      1936          875       Berlin                     Germany
## 10     1948          814       London              United Kingdom
## 11     1952          889     Helsinki                     Finland
## 12     1956          885    Melbourne                   Australia
## 13     1960          882         Rome                       Italy
## 14     1964         1010        Tokyo                       Japan
## 15     1968         1031  Mexico City                      Mexico
## 16     1972         1185       Munich  West Germany (now Germany)
## 17     1976         1305     Montreal                      Canada
## 18     1980         1387       Moscow       U.S.S.R. (now Russia)
## 19     1984         1459  Los Angeles               United States
## 20     1988         1546        Seoul                 South Korea
## 21     1992         1705    Barcelona                       Spain
## 22     1996         1859      Atlanta               United States
## 23     2000         2015       Sydney                   Australia
## 24     2004         1998       Athens                      Greece
## 25     2008         2042      Beijing                       China
##            Country  NOC
## 0      Afghanistan  AFG
## 1          Albania  ALB
## 2          Algeria  ALG
## 3  American Samoa*  ASA
## 4          Andorra  AND
##              Country  NOC
## 196          Vietnam  VIE
## 197  Virgin Islands*  ISV
## 198            Yemen  YEM
## 199           Zambia  ZAM
## 200         Zimbabwe  ZIM
##               Athlete  NOC   Medal  Edition
## 0       HAJOS, Alfred  HUN    Gold     1896
## 1    HERSCHMANN, Otto  AUT  Silver     1896
## 2   DRIVAS, Dimitrios  GRE  Bronze     1896
## 3  MALOKINIS, Ioannis  GRE    Gold     1896
## 4  CHASAPIS, Spiridon  GRE  Silver     1896
##                     Athlete  NOC   Medal  Edition
## 29211        ENGLICH, Mirko  GER  Silver     2008
## 29212  MIZGAITIS, Mindaugas  LTU  Bronze     2008
## 29213       PATRIKEEV, Yuri  ARM  Bronze     2008
## 29214         LOPEZ, Mijain  CUB    Gold     2008
## 29215        BAROEV, Khasan  RUS  Silver     2008
## NOC      AFG  AHO  ALG   ANZ  ARG  ARM  AUS   AUT  AZE  BAH  ...   URS  URU  \
## Edition                                                      ...              
## 1896     NaN  NaN  NaN   NaN  NaN  NaN  2.0   5.0  NaN  NaN  ...   NaN  NaN   
## 1900     NaN  NaN  NaN   NaN  NaN  NaN  5.0   6.0  NaN  NaN  ...   NaN  NaN   
## 1904     NaN  NaN  NaN   NaN  NaN  NaN  NaN   1.0  NaN  NaN  ...   NaN  NaN   
## 1908     NaN  NaN  NaN  19.0  NaN  NaN  NaN   1.0  NaN  NaN  ...   NaN  NaN   
## 1912     NaN  NaN  NaN  10.0  NaN  NaN  NaN  14.0  NaN  NaN  ...   NaN  NaN   
## 
## NOC        USA  UZB  VEN  VIE  YUG  ZAM  ZIM   ZZX  
## Edition                                             
## 1896      20.0  NaN  NaN  NaN  NaN  NaN  NaN   6.0  
## 1900      55.0  NaN  NaN  NaN  NaN  NaN  NaN  34.0  
## 1904     394.0  NaN  NaN  NaN  NaN  NaN  NaN   8.0  
## 1908      63.0  NaN  NaN  NaN  NaN  NaN  NaN   NaN  
## 1912     101.0  NaN  NaN  NaN  NaN  NaN  NaN   NaN  
## 
## [5 rows x 138 columns]
## NOC      AFG  AHO  ALG  ANZ   ARG  ARM    AUS  AUT  AZE  BAH ...   URS  URU  \
## Edition                                                      ...              
## 1992     NaN  NaN  2.0  NaN   2.0  NaN   57.0  6.0  NaN  1.0 ...   NaN  NaN   
## 1996     NaN  NaN  3.0  NaN  20.0  2.0  132.0  3.0  1.0  5.0 ...   NaN  NaN   
## 2000     NaN  NaN  5.0  NaN  20.0  1.0  183.0  4.0  3.0  6.0 ...   NaN  1.0   
## 2004     NaN  NaN  NaN  NaN  47.0  NaN  157.0  8.0  5.0  2.0 ...   NaN  NaN   
## 2008     1.0  NaN  2.0  NaN  51.0  6.0  149.0  3.0  7.0  5.0 ...   NaN  NaN   
## 
## NOC        USA  UZB  VEN  VIE   YUG  ZAM  ZIM  ZZX  
## Edition                                             
## 1992     224.0  NaN  NaN  NaN   NaN  NaN  NaN  NaN  
## 1996     260.0  2.0  NaN  NaN  26.0  1.0  NaN  NaN  
## 2000     248.0  4.0  NaN  1.0  26.0  NaN  NaN  NaN  
## 2004     264.0  5.0  2.0  NaN   NaN  NaN  3.0  NaN  
## 2008     315.0  6.0  1.0  1.0   NaN  NaN  4.0  NaN  
## 
## [5 rows x 138 columns]
## NOC      AFG  AHO  ALG       ANZ  ARG  ARM       AUS       AUT  AZE  BAH  \
## Edition                                                                    
## 1896     NaN  NaN  NaN       NaN  NaN  NaN  0.013245  0.033113  NaN  NaN   
## 1900     NaN  NaN  NaN       NaN  NaN  NaN  0.009766  0.011719  NaN  NaN   
## 1904     NaN  NaN  NaN       NaN  NaN  NaN       NaN  0.002128  NaN  NaN   
## 1908     NaN  NaN  NaN  0.023632  NaN  NaN       NaN  0.001244  NaN  NaN   
## 1912     NaN  NaN  NaN  0.011299  NaN  NaN       NaN  0.015819  NaN  NaN   
## 
## NOC        ...     URS  URU       USA  UZB  VEN  VIE  YUG  ZAM  ZIM       ZZX  
## Edition    ...                                                                 
## 1896       ...     NaN  NaN  0.132450  NaN  NaN  NaN  NaN  NaN  NaN  0.039735  
## 1900       ...     NaN  NaN  0.107422  NaN  NaN  NaN  NaN  NaN  NaN  0.066406  
## 1904       ...     NaN  NaN  0.838298  NaN  NaN  NaN  NaN  NaN  NaN  0.017021  
## 1908       ...     NaN  NaN  0.078358  NaN  NaN  NaN  NaN  NaN  NaN       NaN  
## 1912       ...     NaN  NaN  0.114124  NaN  NaN  NaN  NaN  NaN  NaN       NaN  
## 
## [5 rows x 138 columns]
## NOC          AFG  AHO       ALG  ANZ       ARG       ARM       AUS       AUT  \
## Edition                                                                        
## 1992         NaN  NaN  0.001173  NaN  0.001173       NaN  0.033431  0.003519   
## 1996         NaN  NaN  0.001614  NaN  0.010758  0.001076  0.071006  0.001614   
## 2000         NaN  NaN  0.002481  NaN  0.009926  0.000496  0.090819  0.001985   
## 2004         NaN  NaN       NaN  NaN  0.023524       NaN  0.078579  0.004004   
## 2008     0.00049  NaN  0.000979  NaN  0.024976  0.002938  0.072968  0.001469   
## 
## NOC           AZE       BAH ...   URS       URU       USA       UZB       VEN  \
## Edition                     ...                                                 
## 1992          NaN  0.000587 ...   NaN       NaN  0.131378       NaN       NaN   
## 1996     0.000538  0.002690 ...   NaN       NaN  0.139860  0.001076       NaN   
## 2000     0.001489  0.002978 ...   NaN  0.000496  0.123077  0.001985       NaN   
## 2004     0.002503  0.001001 ...   NaN       NaN  0.132132  0.002503  0.001001   
## 2008     0.003428  0.002449 ...   NaN       NaN  0.154261  0.002938  0.000490   
## 
## NOC           VIE       YUG       ZAM       ZIM  ZZX  
## Edition                                               
## 1992          NaN       NaN       NaN       NaN  NaN  
## 1996          NaN  0.013986  0.000538       NaN  NaN  
## 2000     0.000496  0.012903       NaN       NaN  NaN  
## 2004          NaN       NaN       NaN  0.001502  NaN  
## 2008     0.000490       NaN       NaN  0.001959  NaN  
## 
## [5 rows x 138 columns]
## NOC  Edition  AFG  AHO  ALG        ANZ  ARG  ARM        AUS        AUT  AZE  \
## 0       1896  NaN  NaN  NaN        NaN  NaN  NaN        NaN        NaN  NaN   
## 1       1900  NaN  NaN  NaN        NaN  NaN  NaN -13.134766 -32.304688  NaN   
## 2       1904  NaN  NaN  NaN        NaN  NaN  NaN   0.000000 -30.169386  NaN   
## 3       1908  NaN  NaN  NaN        NaN  NaN  NaN   0.000000 -23.013510  NaN   
## 4       1912  NaN  NaN  NaN -26.092774  NaN  NaN   0.000000   6.254438  NaN   
## 
## NOC    ...      URS  URU         USA  UZB  VEN  VIE  YUG  ZAM  ZIM        ZZX  
## 0      ...      NaN  NaN         NaN  NaN  NaN  NaN  NaN  NaN  NaN        NaN  
## 1      ...      NaN  NaN   -9.448242  NaN  NaN  NaN  NaN  NaN  NaN  33.561198  
## 2      ...      NaN  NaN  199.651245  NaN  NaN  NaN  NaN  NaN  NaN -22.642384  
## 3      ...      NaN  NaN  -19.549222  NaN  NaN  NaN  NaN  NaN  NaN   0.000000  
## 4      ...      NaN  NaN  -12.105733  NaN  NaN  NaN  NaN  NaN  NaN   0.000000  
## 
## [5 rows x 139 columns]
## NOC  Edition  AFG  AHO        ALG  ANZ       ARG        ARM        AUS  \
## 21      1992  NaN  0.0  -7.214076  0.0 -6.767308        NaN   2.754114   
## 22      1996  NaN  0.0   8.959211  0.0  1.306696        NaN  10.743275   
## 23      2000  NaN  0.0  19.762488  0.0  0.515190 -26.935484  12.554986   
## 24      2004  NaN  0.0   0.000000  0.0  9.625365   0.000000   8.161162   
## 25      2008  NaN  0.0  -8.197807  0.0  8.588555  91.266408   6.086870   
## 
## NOC       AUT        AZE ...   URS        URU       USA        UZB       VEN  \
## 21  -3.034840        NaN ...   0.0   0.000000 -1.329330        NaN  0.000000   
## 22  -3.876773        NaN ...   0.0   0.000000 -1.010378        NaN  0.000000   
## 23  -3.464221  88.387097 ...   0.0 -12.025323 -1.341842  42.258065  0.000000   
## 24  -2.186922  48.982144 ...   0.0   0.000000 -1.031922  21.170339 -1.615969   
## 25  -3.389836  31.764436 ...   0.0   0.000000 -0.450031  14.610625 -6.987342   
## 
## NOC       VIE       YUG        ZAM        ZIM  ZZX  
## 21        NaN  0.000000   0.000000   0.000000  0.0  
## 22        NaN -2.667732 -10.758472   0.000000  0.0  
## 23        NaN -2.696445   0.000000   0.000000  0.0  
## 24   0.000000  0.000000   0.000000 -43.491929  0.0  
## 25  -0.661117  0.000000   0.000000 -23.316533  0.0  
## 
## [5 rows x 139 columns]
##          NOC
## Edition     
## 1972     NaN
## 1980     NaN
## 1988     NaN
##     Edition  NOC
## 0      1896  GRE
## 1      1900  FRA
## 2      1904  USA
## 3      1908  GBR
## 4      1912  SWE
## 5      1920  BEL
## 6      1924  FRA
## 7      1928  NED
## 8      1932  USA
## 9      1936  GER
## 10     1948  GBR
## 11     1952  FIN
## 12     1956  AUS
## 13     1960  ITA
## 14     1964  JPN
## 15     1968  MEX
## 16     1972  FRG
## 17     1976  CAN
## 18     1980  URS
## 19     1984  USA
## 20     1988  KOR
## 21     1992  ESP
## 22     1996  USA
## 23     2000  AUS
## 24     2004  GRE
## 25     2008  CHN
## (3588, 3) (26, 139)
##      Edition  NOC     Change
## 567     1992  CHN   4.240630
## 568     1996  CHN   7.860247
## 569     2000  CHN  -3.851278
## 570     2004  CHN   0.128863
## 571     2008  CHN  13.251332
##    Edition  NOC     Change
## 0     1956  AUS  54.615063
## 1     2000  AUS  12.554986
## 2     1920  BEL  54.757887
## 3     1976  CAN  -2.143977
## 4     2008  CHN  13.251332
##          NOC      Change
## Edition                 
## 1896     GRE         NaN
## 1900     FRA  198.002486
## 1904     USA  199.651245
## 1908     GBR  134.489218
## 1912     SWE   71.896226

Summer Olympics - % Change in Medals (Host Country):

Introduction to Databases in Python

Chapter 1 - Basics of Relational Databases

Introduction to Databases - relational tables that store data (course features US Census data):

  • Columns are the name of the field/element, which must be of a single, consistent data type
  • Tables can be joined on common fields (even with the different names) - defined as the “relational model”

Connecting to Your Database - tools in SQLAlchemy, which allows for writing SQL code using Python:

  • Core Model (Relational) will be the focus of this course
  • ORM (User Data Model) is an additional capability of SQLAlchemy
  • The key advantage of SQLAlchemy is the ability to work across database types (SQLite, PostgreSQL, MySQL, etc.)
    • from sqlalchemy import create_engine
    • engine = create_engine(“sqlite:///[myFile].sqlite”) to create the engine, which is the common interface to the database from SQLAlchemy
    • connection = engine.connect()
  • The connection string (such as “sqlite:///census_nyc.sqlite”) describes the database driver (sqlite:///) and the file-name (census_nyc.sqlite, which is in the ./ directory in this example)
    • print(engine.table_names()) will return the table names in the relevant file
  • Reflection is a technique for reading the database and building the SQLAlchemy tables
    • from sqlalchemy import Metadata, Table
    • metadata = MetaData()
    • census = Table(“census”, metadata, autoload=True, autoload_with=engine)
    • print(repr(census)) # will show the column names and data types

Introduction to SQL - basic commands:

  • SELECT column_name FROM table_name to select the specified column from the specified table (if column_name is * it means “all”)
    • Can create a variable, such as stmt = “SELECT * FROM people” ; newVar_proxy = connection.execute(stmt) ; newVar = newVar_proxy.fetchall()
    • The “newVar_proxy” is of type “ResultProxy”, and any commands returned, such as from a .fetchall(), are the “ResultSet”
  • SQLAlchemy allows for a Pythonic way to build complex SQL statements
    • After creating the representation (such as census in the above block)
    • Can then use the most basic command, such as stmt = select([census]), which will be the SQL equivalent of SELECT * FROM census

Example code includes:


myPath = "./PythonInputFiles/"



import pandas as pd


# Appears that the SQL file has two tables, "census" and "state_fact"
# Downloaded a different version of the file from: 
# https://www.gfairchild.com/2011/12/13/2010-census-sqlite-database/
# This data contains ['counties', 'states', 'states_zctas', 'zctas']


# Import create_engine
from sqlalchemy import create_engine

# Create an engine that connects to the census.sqlite file: engine
engine = create_engine("sqlite:///" + myPath + "2010CensusPopulation.db")

# Print table names
print(engine.table_names())


from sqlalchemy import MetaData
metadata = MetaData()  # I think, it has already been loaded/created in the exercises . . . 

# Import Table
from sqlalchemy import Table

# Reflect census table from the engine: census (uses states instead . . . )
# census = Table("census", metadata, autoload=True, autoload_with=engine)
census = Table("states", metadata, autoload=True, autoload_with=engine)

# Print census table metadata
print(repr(census))

# Output in DataCamp example is: Table('census', MetaData(bind=None), Column('state', VARCHAR(length=30), table=<census>), Column('sex', VARCHAR(length=1), table=<census>), Column('age', INTEGER(), table=<census>), Column('pop2000', INTEGER(), table=<census>), Column('pop2008', INTEGER(), table=<census>), schema=None)
# MANY more columns using the data I have

# Reflect the census table from the engine: census (per previous, using 'states' instead)
census = Table("states", metadata, autoload=True, autoload_with=engine)

# Print the column names
print(census.columns.keys())

# Print full table metadata (per previous, using 'states' instead)
print(repr(metadata.tables["states"]))


# Build select statement for census table: stmt
# stmt = "SELECT * FROM census"
stmt = "SELECT * FROM states"

# Execute the statement and fetch the results: results
connection = engine.connect()  # Create connection to the engine defined above (not sure . . . )
results = connection.execute(stmt).fetchall()

# Print Results (too long to print the entire thing)
# print(results)
print(type(results))
print(len(results))
print(results[0])


# Import select
from sqlalchemy import select

# Reflect census table via engine: census (per previous, use states instead)
# census = Table('census', metadata, autoload=True, autoload_with=engine)
census = Table('states', metadata, autoload=True, autoload_with=engine)

# Build select statement for census table: stmt
stmt = select([census])

# Print the emitted statement to see the SQL emitted
print(stmt)

# Execute the statement and print the results (WAY TOO LONG!)
# print(connection.execute(stmt).fetchall())


# Get the first row of the results by using an index: first_row
first_row = results[0]

# Print the first row of the results
print(first_row)

# Print the first column of the first row by using an index
print(first_row[0])

# Print the 'state' column of the first row by using its name
print(first_row["state"])



# Make it a sensible DataFrame
myDF = pd.DataFrame(results)
myDF.columns = census.columns.keys()
print(myDF.shape)

# Melt the data down so that gender and age are the columns
# Key by id-state
# Ax total population and gender subtotals and centroids
colNamesNo = ["centroid_longitude", "centroid_latitude", "population_total", "population_male_total", "population_female_total"]
colNumsNo = [list(myDF.columns).index(x) for x in colNamesNo]

myBasic = myDF.iloc[:, [0, 1] + colNumsNo]  # [0, 1] are id-state
myPreMelt = myDF.iloc[:, [a not in colNumsNo for a in range(len(myDF.columns))]]

myMelt = myPreMelt.melt(id_vars=["id", "state"], var_name="gender_age", value_name="pop2010")
myMelt["gender"] = [x.split("_")[1] for x in myMelt["gender_age"]]
myMelt["age"] = [x.split("_")[2] for x in myMelt["gender_age"]]

print(myMelt.shape)
print(myMelt.head(10))
print(myMelt.tail(10))
print(myMelt["gender"].value_counts())
print(myMelt["age"].value_counts())
print(myMelt.info())
## ['counties', 'states', 'states_zctas', 'zctas']
## Table('states', MetaData(bind=None), Column('id', INTEGER(), table=<states>, primary_key=True, nullable=False), Column('state', TEXT(), table=<states>, nullable=False), Column('centroid_longitude', REAL(), table=<states>, nullable=False), Column('centroid_latitude', REAL(), table=<states>, nullable=False), Column('population_total', INTEGER(), table=<states>, nullable=False), Column('population_male_total', INTEGER(), table=<states>, nullable=False), Column('population_male_lt5', INTEGER(), table=<states>, nullable=False), Column('population_male_5to9', INTEGER(), table=<states>, nullable=False), Column('population_male_10to14', INTEGER(), table=<states>, nullable=False), Column('population_male_15to17', INTEGER(), table=<states>, nullable=False), Column('population_male_18to19', INTEGER(), table=<states>, nullable=False), Column('population_male_20', INTEGER(), table=<states>, nullable=False), Column('population_male_21', INTEGER(), table=<states>, nullable=False), Column('population_male_22to24', INTEGER(), table=<states>, nullable=False), Column('population_male_25to29', INTEGER(), table=<states>, nullable=False), Column('population_male_30to34', INTEGER(), table=<states>, nullable=False), Column('population_male_35to39', INTEGER(), table=<states>, nullable=False), Column('population_male_40to44', INTEGER(), table=<states>, nullable=False), Column('population_male_45to49', INTEGER(), table=<states>, nullable=False), Column('population_male_50to54', INTEGER(), table=<states>, nullable=False), Column('population_male_55to59', INTEGER(), table=<states>, nullable=False), Column('population_male_60to61', INTEGER(), table=<states>, nullable=False), Column('population_male_62to64', INTEGER(), table=<states>, nullable=False), Column('population_male_65to66', INTEGER(), table=<states>, nullable=False), Column('population_male_67to69', INTEGER(), table=<states>, nullable=False), Column('population_male_70to74', INTEGER(), table=<states>, nullable=False), Column('population_male_75to79', INTEGER(), table=<states>, nullable=False), Column('population_male_80to84', INTEGER(), table=<states>, nullable=False), Column('population_male_ge85', INTEGER(), table=<states>, nullable=False), Column('population_female_total', INTEGER(), table=<states>, nullable=False), Column('population_female_lt5', INTEGER(), table=<states>, nullable=False), Column('population_female_5to9', INTEGER(), table=<states>, nullable=False), Column('population_female_10to14', INTEGER(), table=<states>, nullable=False), Column('population_female_15to17', INTEGER(), table=<states>, nullable=False), Column('population_female_18to19', INTEGER(), table=<states>, nullable=False), Column('population_female_20', INTEGER(), table=<states>, nullable=False), Column('population_female_21', INTEGER(), table=<states>, nullable=False), Column('population_female_22to24', INTEGER(), table=<states>, nullable=False), Column('population_female_25to29', INTEGER(), table=<states>, nullable=False), Column('population_female_30to34', INTEGER(), table=<states>, nullable=False), Column('population_female_35to39', INTEGER(), table=<states>, nullable=False), Column('population_female_40to44', INTEGER(), table=<states>, nullable=False), Column('population_female_45to49', INTEGER(), table=<states>, nullable=False), Column('population_female_50to54', INTEGER(), table=<states>, nullable=False), Column('population_female_55to59', INTEGER(), table=<states>, nullable=False), Column('population_female_60to61', INTEGER(), table=<states>, nullable=False), Column('population_female_62to64', INTEGER(), table=<states>, nullable=False), Column('population_female_65to66', INTEGER(), table=<states>, nullable=False), Column('population_female_67to69', INTEGER(), table=<states>, nullable=False), Column('population_female_70to74', INTEGER(), table=<states>, nullable=False), Column('population_female_75to79', INTEGER(), table=<states>, nullable=False), Column('population_female_80to84', INTEGER(), table=<states>, nullable=False), Column('population_female_ge85', INTEGER(), table=<states>, nullable=False), schema=None)
## ['id', 'state', 'centroid_longitude', 'centroid_latitude', 'population_total', 'population_male_total', 'population_male_lt5', 'population_male_5to9', 'population_male_10to14', 'population_male_15to17', 'population_male_18to19', 'population_male_20', 'population_male_21', 'population_male_22to24', 'population_male_25to29', 'population_male_30to34', 'population_male_35to39', 'population_male_40to44', 'population_male_45to49', 'population_male_50to54', 'population_male_55to59', 'population_male_60to61', 'population_male_62to64', 'population_male_65to66', 'population_male_67to69', 'population_male_70to74', 'population_male_75to79', 'population_male_80to84', 'population_male_ge85', 'population_female_total', 'population_female_lt5', 'population_female_5to9', 'population_female_10to14', 'population_female_15to17', 'population_female_18to19', 'population_female_20', 'population_female_21', 'population_female_22to24', 'population_female_25to29', 'population_female_30to34', 'population_female_35to39', 'population_female_40to44', 'population_female_45to49', 'population_female_50to54', 'population_female_55to59', 'population_female_60to61', 'population_female_62to64', 'population_female_65to66', 'population_female_67to69', 'population_female_70to74', 'population_female_75to79', 'population_female_80to84', 'population_female_ge85']
## Table('states', MetaData(bind=None), Column('id', INTEGER(), table=<states>, primary_key=True, nullable=False), Column('state', TEXT(), table=<states>, nullable=False), Column('centroid_longitude', REAL(), table=<states>, nullable=False), Column('centroid_latitude', REAL(), table=<states>, nullable=False), Column('population_total', INTEGER(), table=<states>, nullable=False), Column('population_male_total', INTEGER(), table=<states>, nullable=False), Column('population_male_lt5', INTEGER(), table=<states>, nullable=False), Column('population_male_5to9', INTEGER(), table=<states>, nullable=False), Column('population_male_10to14', INTEGER(), table=<states>, nullable=False), Column('population_male_15to17', INTEGER(), table=<states>, nullable=False), Column('population_male_18to19', INTEGER(), table=<states>, nullable=False), Column('population_male_20', INTEGER(), table=<states>, nullable=False), Column('population_male_21', INTEGER(), table=<states>, nullable=False), Column('population_male_22to24', INTEGER(), table=<states>, nullable=False), Column('population_male_25to29', INTEGER(), table=<states>, nullable=False), Column('population_male_30to34', INTEGER(), table=<states>, nullable=False), Column('population_male_35to39', INTEGER(), table=<states>, nullable=False), Column('population_male_40to44', INTEGER(), table=<states>, nullable=False), Column('population_male_45to49', INTEGER(), table=<states>, nullable=False), Column('population_male_50to54', INTEGER(), table=<states>, nullable=False), Column('population_male_55to59', INTEGER(), table=<states>, nullable=False), Column('population_male_60to61', INTEGER(), table=<states>, nullable=False), Column('population_male_62to64', INTEGER(), table=<states>, nullable=False), Column('population_male_65to66', INTEGER(), table=<states>, nullable=False), Column('population_male_67to69', INTEGER(), table=<states>, nullable=False), Column('population_male_70to74', INTEGER(), table=<states>, nullable=False), Column('population_male_75to79', INTEGER(), table=<states>, nullable=False), Column('population_male_80to84', INTEGER(), table=<states>, nullable=False), Column('population_male_ge85', INTEGER(), table=<states>, nullable=False), Column('population_female_total', INTEGER(), table=<states>, nullable=False), Column('population_female_lt5', INTEGER(), table=<states>, nullable=False), Column('population_female_5to9', INTEGER(), table=<states>, nullable=False), Column('population_female_10to14', INTEGER(), table=<states>, nullable=False), Column('population_female_15to17', INTEGER(), table=<states>, nullable=False), Column('population_female_18to19', INTEGER(), table=<states>, nullable=False), Column('population_female_20', INTEGER(), table=<states>, nullable=False), Column('population_female_21', INTEGER(), table=<states>, nullable=False), Column('population_female_22to24', INTEGER(), table=<states>, nullable=False), Column('population_female_25to29', INTEGER(), table=<states>, nullable=False), Column('population_female_30to34', INTEGER(), table=<states>, nullable=False), Column('population_female_35to39', INTEGER(), table=<states>, nullable=False), Column('population_female_40to44', INTEGER(), table=<states>, nullable=False), Column('population_female_45to49', INTEGER(), table=<states>, nullable=False), Column('population_female_50to54', INTEGER(), table=<states>, nullable=False), Column('population_female_55to59', INTEGER(), table=<states>, nullable=False), Column('population_female_60to61', INTEGER(), table=<states>, nullable=False), Column('population_female_62to64', INTEGER(), table=<states>, nullable=False), Column('population_female_65to66', INTEGER(), table=<states>, nullable=False), Column('population_female_67to69', INTEGER(), table=<states>, nullable=False), Column('population_female_70to74', INTEGER(), table=<states>, nullable=False), Column('population_female_75to79', INTEGER(), table=<states>, nullable=False), Column('population_female_80to84', INTEGER(), table=<states>, nullable=False), Column('population_female_ge85', INTEGER(), table=<states>, nullable=False), schema=None)
## <class 'list'>
## 52
## (1, 'Wyoming', -107.5419255, 42.9918024, 563626, 287437, 20596, 19203, 18592, 11385, 8241, 4406, 4211, 12698, 21752, 18919, 17702, 17149, 19713, 22450, 20928, 7338, 9540, 5058, 6497, 8126, 5704, 4176, 3053, 276189, 19607, 18010, 17363, 10646, 7870, 3971, 3763, 11269, 19524, 17454, 16159, 15956, 19759, 21655, 20018, 6785, 8904, 4976, 6443, 8468, 6788, 5252, 5549)
## SELECT states.id, states.state, states.centroid_longitude, states.centroid_latitude, states.population_total, states.population_male_total, states.population_male_lt5, states.population_male_5to9, states.population_male_10to14, states.population_male_15to17, states.population_male_18to19, states.population_male_20, states.population_male_21, states.population_male_22to24, states.population_male_25to29, states.population_male_30to34, states.population_male_35to39, states.population_male_40to44, states.population_male_45to49, states.population_male_50to54, states.population_male_55to59, states.population_male_60to61, states.population_male_62to64, states.population_male_65to66, states.population_male_67to69, states.population_male_70to74, states.population_male_75to79, states.population_male_80to84, states.population_male_ge85, states.population_female_total, states.population_female_lt5, states.population_female_5to9, states.population_female_10to14, states.population_female_15to17, states.population_female_18to19, states.population_female_20, states.population_female_21, states.population_female_22to24, states.population_female_25to29, states.population_female_30to34, states.population_female_35to39, states.population_female_40to44, states.population_female_45to49, states.population_female_50to54, states.population_female_55to59, states.population_female_60to61, states.population_female_62to64, states.population_female_65to66, states.population_female_67to69, states.population_female_70to74, states.population_female_75to79, states.population_female_80to84, states.population_female_ge85 
## FROM states
## (1, 'Wyoming', -107.5419255, 42.9918024, 563626, 287437, 20596, 19203, 18592, 11385, 8241, 4406, 4211, 12698, 21752, 18919, 17702, 17149, 19713, 22450, 20928, 7338, 9540, 5058, 6497, 8126, 5704, 4176, 3053, 276189, 19607, 18010, 17363, 10646, 7870, 3971, 3763, 11269, 19524, 17454, 16159, 15956, 19759, 21655, 20018, 6785, 8904, 4976, 6443, 8468, 6788, 5252, 5549)
## 1
## Wyoming
## (52, 53)
## (2392, 6)
##    id         state           gender_age  pop2010 gender  age
## 0   1       Wyoming  population_male_lt5    20596   male  lt5
## 1   2  Pennsylvania  population_male_lt5   373216   male  lt5
## 2   3          Ohio  population_male_lt5   367479   male  lt5
## 3   4    New Mexico  population_male_lt5    74078   male  lt5
## 4   5      Maryland  population_male_lt5   185916   male  lt5
## 5   6  Rhode Island  population_male_lt5    29396   male  lt5
## 6   7        Oregon  population_male_lt5   121828   male  lt5
## 7   8   Puerto Rico  population_male_lt5   115173   male  lt5
## 8   9     Wisconsin  population_male_lt5   183391   male  lt5
## 9  10  North Dakota  population_male_lt5    22821   male  lt5
##       id                 state              gender_age  pop2010  gender   age
## 2382  43                  Iowa  population_female_ge85    51307  female  ge85
## 2383  44               Arizona  population_female_ge85    65662  female  ge85
## 2384  45             Minnesota  population_female_ge85    72357  female  ge85
## 2385  46             Louisiana  population_female_ge85    44789  female  ge85
## 2386  47  District of Columbia  population_female_ge85     7198  female  ge85
## 2387  48              Virginia  population_female_ge85    83957  female  ge85
## 2388  49                 Texas  population_female_ge85   204208  female  ge85
## 2389  50               Vermont  population_female_ge85     8694  female  ge85
## 2390  51                 Maine  population_female_ge85    19797  female  ge85
## 2391  52        North Carolina  population_female_ge85   103205  female  ge85
## female    1196
## male      1196
## Name: gender, dtype: int64
## 10to14    104
## 50to54    104
## 62to64    104
## 15to17    104
## 55to59    104
## 20        104
## 70to74    104
## 65to66    104
## 60to61    104
## 21        104
## ge85      104
## 30to34    104
## 80to84    104
## 22to24    104
## lt5       104
## 35to39    104
## 67to69    104
## 40to44    104
## 45to49    104
## 75to79    104
## 25to29    104
## 18to19    104
## 5to9      104
## Name: age, dtype: int64
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2392 entries, 0 to 2391
## Data columns (total 6 columns):
## id            2392 non-null int64
## state         2392 non-null object
## gender_age    2392 non-null object
## pop2010       2392 non-null int64
## gender        2392 non-null object
## age           2392 non-null object
## dtypes: int64(2), object(4)
## memory usage: 74.8+ KB
## None

Chapter 2 - Applying Filtering, Ordering, etc.

Filtering and Targeting Data - select subsets of records based on specified criteria:

  • In SQL, this would be run using WHERE, for example SELECT * FROM census WHERE state == “California”
  • Using sql alchemy, this is a two-line process with stmt = select([census]) ; stmt = stmt.where(census.columns.state == “California”)
    • results = connection.execute(stmt).fetchall()
  • There are additional expressions to add flexibility to the query statements
    • in_(), like(), between() - these are available as methods on the column objects
    • stmt = stmt.where(census.columns.state.startswith(“New”)) will pull back the states that start with “New”
    • and_(), or_(), and not_() are also available to allow for boolean operations - these can be nested, though that is not covered in this class

Overview of Ordering - equivalent of the ORDER BY method of SQL:

  • Can be achieved in SQL Alchemy using stmt.order_by(myTable.columns.myColumn)
  • Can be achieved in SQL Alchemy using stmt.order_by(desc(myTable.columns.myColumn)) # will be a descending sort
  • Can pass multiple rows, such as stmt.order_by(myTable.columns.myColumn1, desc(myTable.columns.myColumn2))

Counting, Summing, and Grouping Data - much more efficient to run these using SQL rather than to grab all the data and run these in Python:

  • Aggregation functions collapse many records in to one - sums or counts for example
  • There is a two-step process to acceess sum: 1) from sqlalchemy import func, followed by 2) using func.sum() inside the relevant code
    • Import sum from sqlalchemy.func would be bad, as it would then conflict with sum from base Python
  • There is also a group_by command that is available for running GROUP BY commands
  • SQL Alchemy auto-generates “column names” for functions in the ResultSet, such as count_1 or sum_2
    • Can instead append .label(“myLabel”) to the desired calculation, and then “myLabel” will replace sum_2

Visualize Data using pandas and matplotlib:

  • Can create the DataFrame using df=pd.DataFrame(results), followed by df.columns = results[0].keys()

Example code includes:


myPath = "./PythonInputFiles/"



import pandas as pd


# Import create_engine function
from sqlalchemy import create_engine, MetaData, Table, select

# Create an engine to the census database
# engine = create_engine('postgresql+psycopg2://' + 'student:datacamp' + '@postgresql.csrrinzqubik.us-east-1.rds.amazonaws.com' + ':5432/census')

# Created dummy data with real state-gender-age-pop2010 and totally fake pop2000 = (0.90, 1.05) * pop2010
engine = create_engine("sqlite:///" + myPath + "PartialFakeCensusExample.db")

# Use the .table_names() method on the engine to print the table names
print(engine.table_names())

# Create a select query: stmt
metadata = MetaData()
census = Table("census", metadata, autoload=True, autoload_with=engine)  # make sure this is set up
stmt = select([census])

# Add a where clause to filter the results to only those for New York
stmt = stmt.where(census.columns["state"] == "New York")

# Execute the query to retrieve all the data returned: results
# Execute the statement and fetch the results: results
connection = engine.connect()  # Create connection to the engine defined above (not sure . . . )
results = connection.execute(stmt).fetchall()

# Loop over the results and print the age, sex (gender), and pop2008 (pop2010)
for result in results:
    print(result.age, result.gender, result.pop2010)


states = ['New York', 'California', 'Texas']

# Create a query for the census table: stmt
stmt = select([census])

# Append a where clause to match all the states in_ the list states
stmt = stmt.where(census.columns.state.in_(states))

# Loop over the ResultProxy and print the state and its population in 2000
for x in connection.execute(stmt):
    print(x.state, x.pop2000)


# Import and_
from sqlalchemy import and_

# Build a query for the census table: stmt
stmt = select([census])

# Append a where clause to select only non-male records from California using and_
stmt = stmt.where(
    # The state of California with a non-male sex
    and_(census.columns.state == "California",
         census.columns.gender != "male"
         )
)

# Loop over the ResultProxy printing the age and sex
for result in connection.execute(stmt):
    print(result.age, result.gender)


# Build a query to select the state column: stmt
stmt = select([census.columns.state])

# Order stmt by the state column
stmt = stmt.order_by(census.columns.state)

# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()

# Print the first 10 results
print(results[:10])


# Import desc
from sqlalchemy import desc

# Build a query to select the state column: stmt
stmt = select([census.columns.state])

# Order stmt by state in descending order: rev_stmt
rev_stmt = stmt.order_by(desc(census.columns.state))

# Execute the query and store the results: rev_results
rev_results = connection.execute(rev_stmt).fetchall()

# Print the first 10 rev_results
print(rev_results[:10])


# Build a query to select state and age: stmt
stmt = select([census.columns.state, census.columns.age])

# Append order by to ascend by state and descend by age
stmt = stmt.order_by(census.columns.state, desc(census.columns.age))

# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()

# Print the first 20 results
print(results[:20])


from sqlalchemy import func

# Build a query to count the distinct states values: stmt
stmt = select([func.count(census.columns.state.distinct())])

# Execute the query and store the scalar result: distinct_state_count
distinct_state_count = connection.execute(stmt).scalar()

# Print the distinct_state_count
print(distinct_state_count)


# Import func
from sqlalchemy import func

# Build a query to select the state and count of ages by state: stmt
stmt = select([census.columns.state, func.count(census.columns.age)])

# Group stmt by state
stmt = stmt.group_by(census.columns.state)

# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()

# Print results
print(results)

# Print the keys/column names of the results returned
print(results[0].keys())


# Import func
from sqlalchemy import func

# Build an expression to calculate the sum of pop2008 labeled as population
pop2010_sum = func.sum(census.columns.pop2010).label("population")

# Build a query to select the state and sum of pop2008: stmt
stmt = select([census.columns.state, pop2010_sum])

# Group stmt by state
stmt = stmt.group_by(census.columns.state)

# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()

# Print results
print(results)

# Print the keys/column names of the results returned
print(results[0].keys())


# import pandas
import pandas as pd

# Create a DataFrame from the results: df
df = pd.DataFrame(results)

# Set column names
df.columns = results[0].keys()

# Print the Dataframe
print(df)


# Import Pyplot as plt from matplotlib
import matplotlib.pyplot as plt

# Plot the DataFrame
df.sort_values("population", ascending=False).set_index("state").plot.bar()
# plt.show()
plt.savefig("_dummyPy074.png", bbox_inches="tight")
plt.clf()
## ['census', 'state_fact']
## lt5 male 590879
## 5to9 male 594362
## 10to14 male 619243
## 15to17 male 406797
## 18to19 male 292751
## 20 male 149840
## 21 male 143298
## 22to24 male 418864
## 25to29 male 680203
## 30to34 male 629759
## 35to39 male 613775
## 40to44 male 663333
## 45to49 male 709523
## 50to54 male 687779
## 55to59 male 591847
## 60to61 male 214047
## 62to64 male 286312
## 65to66 male 151551
## 67to69 male 200704
## 70to74 male 258616
## 75to79 male 200049
## 80to84 male 150993
## ge85 male 122622
## lt5 female 564943
## 5to9 female 569593
## 10to14 female 592213
## 15to17 female 386899
## 18to19 female 279831
## 20 female 143243
## 21 female 138298
## 22to24 female 417392
## 25to29 female 699974
## 30to34 female 649401
## 35to39 female 640349
## 40to44 female 692560
## 45to49 female 749240
## 50to54 female 732149
## 55to59 female 645561
## 60to61 female 239946
## 62to64 female 325955
## 65to66 female 178609
## 67to69 female 242347
## 70to74 female 328775
## 75to79 female 274758
## 80to84 female 240667
## ge85 female 268252
## New York 567834
## California 1261704
## Texas 969386
## New York 569398
## California 1287240
## Texas 950364
## New York 601284
## California 1332544
## Texas 955163
## New York 424289
## California 880198
## Texas 602017
## New York 297435
## California 542407
## Texas 351896
## New York 145494
## California 282228
## Texas 199048
## New York 145590
## California 282458
## Texas 190191
## New York 427241
## California 815489
## Texas 504550
## New York 688365
## California 1411515
## Texas 861970
## New York 625980
## California 1246955
## Texas 912905
## New York 642622
## California 1203556
## Texas 889281
## New York 648739
## California 1234523
## Texas 803674
## New York 744289
## California 1366139
## Texas 814497
## New York 655453
## California 1305805
## Texas 767493
## New York 543907
## California 1114914
## Texas 630442
## New York 208909
## California 345910
## Texas 242777
## New York 297764
## California 474567
## Texas 298426
## New York 141397
## California 281418
## Texas 168120
## New York 209334
## California 355970
## Texas 229053
## New York 251116
## California 423892
## Texas 257465
## New York 200849
## California 337174
## Texas 212283
## New York 142990
## California 230677
## Texas 141253
## New York 121273
## California 220723
## Texas 103495
## New York 526526
## California 1143243
## Texas 960377
## New York 528012
## California 1210334
## Texas 957641
## New York 567340
## California 1228329
## Texas 908907
## New York 403535
## California 845514
## Texas 523162
## New York 254926
## California 533823
## Texas 333994
## New York 132643
## California 268662
## Texas 174983
## New York 125297
## California 258666
## Texas 166167
## New York 376070
## California 754520
## Texas 486828
## New York 655175
## California 1378453
## Texas 906760
## New York 636412
## California 1305925
## Texas 860873
## New York 668524
## California 1205276
## Texas 827988
## New York 632999
## California 1343752
## Texas 867432
## New York 696043
## California 1401748
## Texas 836895
## New York 705059
## California 1169000
## Texas 838466
## New York 583587
## California 1175155
## Texas 678970
## New York 235147
## California 424561
## Texas 236343
## New York 339971
## California 509927
## Texas 317358
## New York 172357
## California 284615
## Texas 195515
## New York 240165
## California 386112
## Texas 265604
## New York 317596
## California 488747
## Texas 339314
## New York 281901
## California 444839
## Texas 268446
## New York 245721
## California 333861
## Texas 216712
## New York 280055
## California 395224
## Texas 189300
## lt5 female
## 5to9 female
## 10to14 female
## 15to17 female
## 18to19 female
## 20 female
## 21 female
## 22to24 female
## 25to29 female
## 30to34 female
## 35to39 female
## 40to44 female
## 45to49 female
## 50to54 female
## 55to59 female
## 60to61 female
## 62to64 female
## 65to66 female
## 67to69 female
## 70to74 female
## 75to79 female
## 80to84 female
## ge85 female
## [('Alabama',), ('Alabama',), ('Alabama',), ('Alabama',), ('Alabama',), ('Alabama',), ('Alabama',), ('Alabama',), ('Alabama',), ('Alabama',)]
## [('Wyoming',), ('Wyoming',), ('Wyoming',), ('Wyoming',), ('Wyoming',), ('Wyoming',), ('Wyoming',), ('Wyoming',), ('Wyoming',), ('Wyoming',)]
## [('Alabama', 'lt5'), ('Alabama', 'lt5'), ('Alabama', 'ge85'), ('Alabama', 'ge85'), ('Alabama', '80to84'), ('Alabama', '80to84'), ('Alabama', '75to79'), ('Alabama', '75to79'), ('Alabama', '70to74'), ('Alabama', '70to74'), ('Alabama', '67to69'), ('Alabama', '67to69'), ('Alabama', '65to66'), ('Alabama', '65to66'), ('Alabama', '62to64'), ('Alabama', '62to64'), ('Alabama', '60to61'), ('Alabama', '60to61'), ('Alabama', '5to9'), ('Alabama', '5to9')]
## 52
## [('Alabama', 46), ('Alaska', 46), ('Arizona', 46), ('Arkansas', 46), ('California', 46), ('Colorado', 46), ('Connecticut', 46), ('Delaware', 46), ('District of Columbia', 46), ('Florida', 46), ('Georgia', 46), ('Hawaii', 46), ('Idaho', 46), ('Illinois', 46), ('Indiana', 46), ('Iowa', 46), ('Kansas', 46), ('Kentucky', 46), ('Louisiana', 46), ('Maine', 46), ('Maryland', 46), ('Massachusetts', 46), ('Michigan', 46), ('Minnesota', 46), ('Mississippi', 46), ('Missouri', 46), ('Montana', 46), ('Nebraska', 46), ('Nevada', 46), ('New Hampshire', 46), ('New Jersey', 46), ('New Mexico', 46), ('New York', 46), ('North Carolina', 46), ('North Dakota', 46), ('Ohio', 46), ('Oklahoma', 46), ('Oregon', 46), ('Pennsylvania', 46), ('Puerto Rico', 46), ('Rhode Island', 46), ('South Carolina', 46), ('South Dakota', 46), ('Tennessee', 46), ('Texas', 46), ('Utah', 46), ('Vermont', 46), ('Virginia', 46), ('Washington', 46), ('West Virginia', 46), ('Wisconsin', 46), ('Wyoming', 46)]
## ['state', 'count_1']
## [('Alabama', 4779736), ('Alaska', 710231), ('Arizona', 6392017), ('Arkansas', 2915918), ('California', 37253956), ('Colorado', 5029196), ('Connecticut', 3574097), ('Delaware', 897934), ('District of Columbia', 601723), ('Florida', 18801310), ('Georgia', 9687653), ('Hawaii', 1360301), ('Idaho', 1567582), ('Illinois', 12830632), ('Indiana', 6483802), ('Iowa', 3046355), ('Kansas', 2853118), ('Kentucky', 4339367), ('Louisiana', 4533372), ('Maine', 1328361), ('Maryland', 5773552), ('Massachusetts', 6547629), ('Michigan', 9883640), ('Minnesota', 5303925), ('Mississippi', 2967297), ('Missouri', 5988927), ('Montana', 989415), ('Nebraska', 1826341), ('Nevada', 2700551), ('New Hampshire', 1316470), ('New Jersey', 8791894), ('New Mexico', 2059179), ('New York', 19378102), ('North Carolina', 9535483), ('North Dakota', 672591), ('Ohio', 11536504), ('Oklahoma', 3751351), ('Oregon', 3831074), ('Pennsylvania', 12702379), ('Puerto Rico', 3725789), ('Rhode Island', 1052567), ('South Carolina', 4625364), ('South Dakota', 814180), ('Tennessee', 6346105), ('Texas', 25145561), ('Utah', 2763885), ('Vermont', 625741), ('Virginia', 8001024), ('Washington', 6724540), ('West Virginia', 1852994), ('Wisconsin', 5686986), ('Wyoming', 563626)]
## ['state', 'population']
##                    state  population
## 0                Alabama     4779736
## 1                 Alaska      710231
## 2                Arizona     6392017
## 3               Arkansas     2915918
## 4             California    37253956
## 5               Colorado     5029196
## 6            Connecticut     3574097
## 7               Delaware      897934
## 8   District of Columbia      601723
## 9                Florida    18801310
## 10               Georgia     9687653
## 11                Hawaii     1360301
## 12                 Idaho     1567582
## 13              Illinois    12830632
## 14               Indiana     6483802
## 15                  Iowa     3046355
## 16                Kansas     2853118
## 17              Kentucky     4339367
## 18             Louisiana     4533372
## 19                 Maine     1328361
## 20              Maryland     5773552
## 21         Massachusetts     6547629
## 22              Michigan     9883640
## 23             Minnesota     5303925
## 24           Mississippi     2967297
## 25              Missouri     5988927
## 26               Montana      989415
## 27              Nebraska     1826341
## 28                Nevada     2700551
## 29         New Hampshire     1316470
## 30            New Jersey     8791894
## 31            New Mexico     2059179
## 32              New York    19378102
## 33        North Carolina     9535483
## 34          North Dakota      672591
## 35                  Ohio    11536504
## 36              Oklahoma     3751351
## 37                Oregon     3831074
## 38          Pennsylvania    12702379
## 39           Puerto Rico     3725789
## 40          Rhode Island     1052567
## 41        South Carolina     4625364
## 42          South Dakota      814180
## 43             Tennessee     6346105
## 44                 Texas    25145561
## 45                  Utah     2763885
## 46               Vermont      625741
## 47              Virginia     8001024
## 48            Washington     6724540
## 49         West Virginia     1852994
## 50             Wisconsin     5686986
## 51               Wyoming      563626

Population (2010) by State:


Chapter 3 - Advanced SQL Alchemy Queries

Calculating Values in a Query - addition, subtraction, multiplication, and the like:

  • Can put calculations directly in the select statement, such as select([(census.columns.pop2008 - census.columns.pop2000).label(“pop_change”)])
  • Can limit the number of records pulled using .limit(5) # will return 5 in this case; can use any number
  • Case statements can help with treating data differently based on a condition (includes a final else clause, represented as else_, for full-on mismatches)
    • from sqlalchemy import case
    • func.sum( case( [ (census.columns.state == “New York”, census.columns.pop2008) ], else_=0 ))
  • Cast statements can be useful for converting among integers, floats, strings, and the like
    • from sqlalchemy import case, cast, Float
    • cast(func.sum(census.columns.pop2008), Float) # will convert the sum of the population columns to a float

SQL Relationships - bridging data that appears in multiple SQL tables:

  • Sometimes, an automatic join type is pre-defined in the database; if so, the simple select statement from multiple tables wil perform the join (???)
  • Can instead use the join clause to perform the join if it has not been pre-defined - should be directly after select()
    • This is implemented in SQL Alchemy using the select_from() function
    • stmt = select([func.sum(census.columns.pop2000)])
    • stmt = stmt.select_from(census.join(state_fact)) # optionally, stmt = stmt.select_from(census.join(state_fact, census.columns.state == state_fact.columns.name))
    • stmt = stmt.where(state_fact.columns.circuit_court == “10”)

Working with Hierarchical Tables (self-referential tables) - tables that refer to themselves:

  • The alias() method allows for referring to a table with two different names, making it possible to join columns from the same table to each other
    • managers = employees.alias() # managers will now refer to the employees table
    • managers.columns.name.label(“manager”)
    • employees.columns.name.label(“employee”)
    • stmt = stmt.select_from(employees.join( managers, managers.columns.id == employees.columns.manager ))
    • stmt = stmt.order_by(managers.columns.name)
  • The alias and the table name should both be used in the query, otherwise there was no reason to create the alias
    • Be careful with group_by() and the like

Dealing with large ResultSets - running out of memory or disk space or the like:

  • The fetchmany() method allows for retrieveing only a subset of the records from SQL, with the option to retrieve more records later
    • Returns an empty list when there is nothing left to retrieve
    • Need to close the ResultProxy afterwards

Example code includes:


myPath = "./PythonInputFiles/"



import pandas as pd


# Import sqlalchemy functions
from sqlalchemy import create_engine, MetaData, Table, select, func, desc

# Create an engine to the census database
# engine = create_engine('mysql+pymysql://' + 'student:datacamp' + '@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:3306/' + 'census')
# Created dummy data with real state-gender-age-pop2010 and totally fake pop2000 = (0.90, 1.05) * pop2010
engine = create_engine("sqlite:///" + myPath + "PartialFakeCensusExample.db")

# Print the table names
print(engine.table_names())

# General pre-amble to be able to access "census"
metadata = MetaData()
census = Table("census", metadata, autoload=True, autoload_with=engine)  # make sure this is set up
state_fact = Table("state_fact", metadata, autoload=True, autoload_with=engine)  # make sure this is set up

# Build query to return state names by population difference from 2008 (make 2010) to 2000: stmt
stmt = select([census.columns.state, (census.columns.pop2010 - census.columns.pop2000).label("pop_change")])

# Append group by for the state: stmt
stmt = stmt.group_by(census.columns.state)

# Append order by for pop_change descendingly: stmt
stmt = stmt.order_by(desc("pop_change"))

# Return only 5 results: stmt
stmt = stmt.limit(5)

# Use connection to execute the statement and fetch all results
connection = engine.connect()  # Create connection to the engine defined above (not sure . . . )
results = connection.execute(stmt).fetchall()

# Print the state and population change for each record
for result in results:
    print('{}:{}'.format(result.state, result.pop_change))


# import case, cast and Float from sqlalchemy
from sqlalchemy import case, cast, Float

# Build an expression to calculate female population in 2000
female_pop2000 = func.sum(
    case([
        (census.columns.gender == "female", census.columns.pop2000)
    ], else_=0))

# Cast an expression to calculate total population in 2000 to Float
total_pop2000 = cast(func.sum(census.columns.pop2000), Float)

# Build a query to calculate the percentage of females in 2000: stmt
stmt = select([female_pop2000 / total_pop2000* 100])

# Execute the query and store the scalar result: percent_female
percent_female = connection.execute(stmt).scalar()

# Print the percentage
print(percent_female)


# Build a statement to join census and state_fact tables: stmt
stmt = select([census.columns.pop2000, state_fact.columns.abbreviation])

# Execute the statement and get the first result: result
result = connection.execute(stmt).first()

# Loop over the keys in the result object and print the key and value
for key in result.keys():
    print(key, getattr(result, key))


# Build a statement to select the census and state_fact tables: stmt
stmt = select([census, state_fact])

# Add a select_from clause that wraps a join for the census and state_fact
# tables where the census state column and state_fact name column match
stmt = stmt.select_from(
    (census.join(state_fact, census.columns.state == state_fact.columns.name)))

# Execute the statement and get the first result: result
result = connection.execute(stmt).first()

# Loop over the keys in the result object and print the key and value
for key in result.keys():
    print(key, getattr(result, key))


# Build a statement to select the state, sum of 2008 (using 2010 instead) population and census
# division name: stmt
stmt = select([
    census.columns.state,
    func.sum(census.columns.pop2010),
    state_fact.columns.census_division_name
])

# Append select_from to join the census and state_fact tables by the census state and state_fact name columns
stmt = stmt.select_from(
    census.join(state_fact, census.columns.state == state_fact.columns.name)
)

# Append a group by for the state_fact name column
stmt = stmt.group_by(state_fact.columns.name)

# Execute the statement and get the results: results
results = connection.execute(stmt).fetchall()

# Loop over the the results object and print each record.
for record in results:
    print(record)


# Make an alias of the employees table: managers
# managers = employees.alias()

# Build a query to select manager's and their employees names: stmt
# stmt = select(
#     [managers.columns.name.label('manager'),
#      employees.columns.name.label("employee")]
# )

# Match managers id with employees mgr: stmt
# stmt = stmt.where(managers.columns.id == employees.columns.mgr)

# Order the statement by the managers name: stmt
# stmt = stmt.order_by(managers.columns.name)

# Execute statement: results
# results = connection.execute(stmt).fetchall()

# Print records
# for record in results:
#     print(record)


# Make an alias of the employees table: managers
# managers = employees.alias()

# Build a query to select managers and counts of their employees: stmt
# stmt = select([managers.columns.name, func.count(employees.columns.id)])

# Append a where clause that ensures the manager id and employee mgr are equal
# stmt = stmt.where(managers.columns.id == employees.columns.mgr)

# Group by Managers Name
# stmt = stmt.group_by(managers.columns.name)

# Execute statement: results
# results = connection.execute(stmt).fetchall()

# print manager
# for record in results:
#     print(record)


# Start a while loop checking for more results
# while more_results:
    # Fetch the first 50 results from the ResultProxy: partial_results
#     partial_results = results_proxy.fetchmany(50)

    # if empty list, set more_results to False
#     if partial_results == []:
#         more_results = False

    # Loop over the fetched records and increment the count for the state
#     for row in partial_results:
#         if row.state in state_count:
#             state_count[row.state] += 1
#         else:
#             state_count[row.state] = 1

# Close the ResultProxy, and thus the connection
# results_proxy.close()

# Print the count by state
# print(state_count)
## ['census', 'state_fact']
## Florida:22065
## Illinois:15716
## Texas:14908
## Indiana:6848
## Massachusetts:6111
## 50.85769837165718
## pop2000 21543
## abbreviation AL
## id 1
## state Wyoming
## gender male
## age lt5
## pop2000 21543
## pop2010 20596
## name Wyoming
## abbreviation WY
## census_division_name 8 (West / Mountain)
## ('Alabama', 4779736, '6 (South / East South Central)')
## ('Alaska', 710231, '9 (West / Pacific)')
## ('Arizona', 6392017, '8 (West / Mountain)')
## ('Arkansas', 2915918, '7 (South / West South Central)')
## ('California', 37253956, '9 (West / Pacific)')
## ('Colorado', 5029196, '8 (West / Mountain)')
## ('Connecticut', 3574097, '1 (Northeast / New England)')
## ('Delaware', 897934, '5 (South / South Atlantic)')
## ('District of Columbia', 601723, '5 (South / South Atlantic)')
## ('Florida', 18801310, '5 (South / South Atlantic)')
## ('Georgia', 9687653, '5 (South / South Atlantic)')
## ('Hawaii', 1360301, '9 (West / Pacific)')
## ('Idaho', 1567582, '8 (West / Mountain)')
## ('Illinois', 12830632, '3 (Midwest / East North Central)')
## ('Indiana', 6483802, '3 (Midwest / East North Central)')
## ('Iowa', 3046355, '4 (Midwest / West North Central)')
## ('Kansas', 2853118, '4 (Midwest / West North Central)')
## ('Kentucky', 4339367, '6 (South / East South Central)')
## ('Louisiana', 4533372, '7 (South / West South Central)')
## ('Maine', 1328361, '1 (Northeast / New England)')
## ('Maryland', 5773552, '5 (South / South Atlantic)')
## ('Massachusetts', 6547629, '1 (Northeast / New England)')
## ('Michigan', 9883640, '3 (Midwest / East North Central)')
## ('Minnesota', 5303925, '4 (Midwest / West North Central)')
## ('Mississippi', 2967297, '6 (South / East South Central)')
## ('Missouri', 5988927, '4 (Midwest / West North Central)')
## ('Montana', 989415, '8 (West / Mountain)')
## ('Nebraska', 1826341, '4 (Midwest / West North Central)')
## ('Nevada', 2700551, '8 (West / Mountain)')
## ('New Hampshire', 1316470, '1 (Northeast / New England)')
## ('New Jersey', 8791894, '2 (Northeast / Mid-Atlantic)')
## ('New Mexico', 2059179, '8 (West / Mountain)')
## ('New York', 19378102, '2 (Northeast / Mid-Atlantic)')
## ('North Carolina', 9535483, '5 (South / South Atlantic)')
## ('North Dakota', 672591, '4 (Midwest / West North Central)')
## ('Ohio', 11536504, '3 (Midwest / East North Central)')
## ('Oklahoma', 3751351, '7 (South / West South Central)')
## ('Oregon', 3831074, '9 (West / Pacific)')
## ('Pennsylvania', 12702379, '2 (Northeast / Mid-Atlantic)')
## ('Puerto Rico', 3725789, '0 (None)')
## ('Rhode Island', 1052567, '1 (Northeast / New England)')
## ('South Carolina', 4625364, '5 (South / South Atlantic)')
## ('South Dakota', 814180, '4 (Midwest / West North Central)')
## ('Tennessee', 6346105, '6 (South / East South Central)')
## ('Texas', 25145561, '7 (South / West South Central)')
## ('Utah', 2763885, '8 (West / Mountain)')
## ('Vermont', 625741, '1 (Northeast / New England)')
## ('Virginia', 8001024, '5 (South / South Atlantic)')
## ('Washington', 6724540, '9 (West / Pacific)')
## ('West Virginia', 1852994, '5 (South / South Atlantic)')
## ('Wisconsin', 5686986, '3 (Midwest / East North Central)')
## ('Wyoming', 563626, '8 (West / Mountain)')

Chapter 4 - Creating and Manipulating Databases

Creating Databases and Tables - different by database types, and outside the scope of this course:

  • Inside SQLite, the create_engine() call will create the database and/or file if they do not already exist
    • from sqlalchemy import (Table, Column, String, Integer, Decimal, Boolean)
    • employees = Table(“employees”, metadata, Column(“id”, Integer()), Column(“name”, String(255)))
    • metadata.create_all(engine)
    • engine.table_names() # verify that table “employees” has been created
  • Can set column options such as unique, nullable, etc,; default is chosen if none are selected
    • These are each settings inside the Column() calls, such as unique=True, nullable=False, default=100.00, etc.
    • Can check these with myTable.constraints()

Inserting Data into a Table - done with the insert() command:

  • from sqlalchemy import insert
  • stmt = insert(employees).values(id=1, name=“Jason”)
  • Alternately, can insert multiple values using a list of dictionaries
    • stmt = insert(employees)
    • values_list = [ {“id”:2, “name”:“Rebecca”} , {“id”:3, “name”:“Bob”} ]
    • result_proxy = connection.execute(stmt, values_list)

Updating Data in a Database - done with the update() statement, like an insert() statement but with a where clause:

  • from sqlalchemy import update
  • stmt = update(employees)
  • stmt = stmt.where(employees.columns.id == 3)
  • stmt = stmt.values(active=True)
  • result_proxy = connection.execute(stmt)
  • Correlated Updated - using a select statement to find a key value that is then used to update other portions of the table

Removing Data from a Database - done with the delete() statement - BE CAREFUL!:

  • from sqlalchemy import delete
  • stmt = select([func.count(extra_employees.columns.id)])
  • connection.execute(stmt).scalar()
  • delete_stmt = delete(extra_employees)
  • result_proxy = connection.execute(delete_stmt)
  • Can instead use where clauses, such as
    • stmt = delete(employees).where(employees.columns.id == 3)
  • Dropping a table completely involves using the “drop” method on the table - metadata will still be in Python until the next re-start, though
    • extra_employees.drop(engine)
    • extra_employees.exists(engine) # will now be False
  • Dropping all tables using the metadata - use the drop_all() command
    • metadata.drop_all(engine)

Example code includes:


myPath = "./PythonInputFiles/"



import pandas as pd


# Import sqlalchemy functions
from sqlalchemy import create_engine, MetaData, Table, select, func, desc

# Import Table, Column, String, Integer, Float, Boolean from sqlalchemy
from sqlalchemy import Table, Column, String, Integer, Float, Boolean


# Set up for a new FAKE database
engine = create_engine("sqlite:///" + myPath + "_notuse_CreatedFake.db")
print(engine.table_names())
metadata = MetaData()


# Define a new table with a name, count, amount, and valid column: data
data = Table('data', metadata,
             Column("name", String(255)),
             Column('count', Integer()),
             Column("amount", Float()),
             Column("valid", Boolean())
)

# Use the metadata to create the table
metadata.create_all(engine)

# Print table details
print(repr(data))


# Define a new table with a name, count, amount, and valid column: data
data02 = Table('data02', metadata,
               Column('name', String(255), unique=True),
               Column('count', Integer(), default=1),
               Column('amount', Float()),
               Column('valid', Boolean(), default=False)
)

# Use the metadata to create the table
metadata.create_all(engine)

# Print the table details
print(repr(metadata.tables['data02']))


# Import insert and select from sqlalchemy
from sqlalchemy import insert

# Build an insert statement to insert a record into the data table: stmt
stmt = insert(data02).values(name="Anna", count=1, amount=1000.00, valid=True)

# Execute the statement via the connection: results
connection = engine.connect()
results = connection.execute(stmt)

# Print result rowcount
print(results.rowcount)

# Build a select statement to validate the insert
stmt = select([data02]).where(data02.columns.name == "Anna")

# Print the result of executing the query.
print(connection.execute(stmt).first())


# Delete the row so the table is empty again
stmt = "DELETE FROM data02"  # Since there is no WHERE, this will delete everything
results = connection.execute(stmt)
print(results.rowcount)


# Build a list of dictionaries: values_list
values_list = [
    {'name': "Anna", 'count': 1, 'amount': 1000.00, 'valid': True},
    {'name': "Taylor", 'count': 1, 'amount': 750.00, 'valid': False}
]

# Build an insert statement for the data table: stmt
stmt = insert(data02)

# Execute stmt with the values_list: results
results = connection.execute(stmt, values_list)

# Print rowcount
print(results.rowcount)


# Place census data in the fake DB
census = Table('census', metadata,
               Column('state', String(255)),
               Column('gender', String(6)),
               Column('age', String(255)),
               Column('pop2000', Integer()),
               Column('pop2010', Integer())
)

metadata.create_all(engine)
print(repr(data))


# Create a insert statement for census: stmt
stmt = insert(census)

# Create an empty list and zeroed row count: values_list, total_rowcount
values_list = []
total_rowcount = 0



# Enumerate the rows of csv_reader
for idx, row in enumerate(open(myPath + "_notuse_census2000.csv", "r")):
    if idx == 0 : 
        print("Headers are: ", row)
        continue
    
    # Headers for this file are id,state,gender,age,pop2000,pop2010
    rowItems = row.split(",")
    data = {'state': rowItems[1], 'gender': rowItems[2], 'age': rowItems[3], 'pop2000': int(rowItems[4]),
            'pop2010': int(rowItems[5])}
    values_list.append(data)
    
    # Check to see if divisible by 51
    if idx % 51 == 0:
        results = connection.execute(stmt, values_list)
        total_rowcount += results.rowcount
        values_list = []

# Print total rowcount
print(total_rowcount)


# Place state_fact data in the fake DB
state_fact = Table('state_fact', metadata,
               Column('name', String(255)),
               Column('abbreviation', String(2)),
               Column('census_division_name', String(255)),
               Column('fips_state', Integer(), default=0),
               Column('notes', String(255), default="none")
)

metadata.create_all(engine)
print(repr(state_fact))


# Read CSV for state facts
stateFact = pd.read_csv(myPath + "_notuse_stateFact.csv")
values_list = []

for x in range(stateFact.shape[0]):
    y = stateFact.iloc[x, :]
    values_list.append( { "name":y["name"], "abbreviation":y["abbreviation"], "census_division_name":y["census_division_name"] })


# Create the table
stmt = insert(state_fact)
results = connection.execute(stmt, values_list)


# Build a select statement: select_stmt
select_stmt = select([state_fact]).where(state_fact.columns.name == "New York")

# Print the results of executing the select_stmt
print(connection.execute(select_stmt).fetchall())

# Build a statement to update the fips_state to 36: stmt
from sqlalchemy import update
stmt = update(state_fact).values(fips_state = 36)

# Append a where clause to limit it to records for New York state
stmt = stmt.where(state_fact.columns.name == "New York")

# Execute the statement: results
results = connection.execute(stmt)

# Print rowcount
print(results.rowcount)

# Execute the select_stmt again to view the changes
print(connection.execute(select_stmt).fetchall())


# Build a statement to update the notes to 'The Wild West': stmt
stmt = update(state_fact).values(notes = "The Wild West")

# Append a where clause to match the West census region records
stmt = stmt.where(state_fact.columns.census_division_name == "8 (West / Mountain)")

# Execute the statement: results
results = connection.execute(stmt)

# Print rowcount
print(results.rowcount)


# Build a statement to select name from state_fact: stmt
# fips_stmt = select([state_fact.columns.name])

# Append a where clause to Match the fips_state to flat_census fips_code
# fips_stmt = fips_stmt.where(
#     state_fact.columns.fips_state == flat_census.columns.fips_code)

# Build an update statement to set the name to fips_stmt: update_stmt
# update_stmt = update(flat_census).values(state_name=fips_stmt)

# Execute update_stmt: results
# results = connection.execute(update_stmt)

# Print rowcount
# print(results.rowcount)


# Import delete, select
from sqlalchemy import delete, select

# Build a statement to empty the census table: stmt
stmt = delete(census)

# Execute the statement: results
results = connection.execute(stmt)

# Print affected rowcount
print(results.rowcount)

# Build a statement to select all records from the census table
stmt = select([census])

# Print the results of executing the statement to verify there are no rows
print(connection.execute(stmt).fetchall())


# Build a statement to count records using the sex column for Men ('M') age 36: stmt
# stmt = select([func.count(census.columns.sex)]).where(
#     and_(census.columns.sex == 'M',
#          census.columns.age == 36)
# )

# Execute the select statement and use the scalar() fetch method to save the record count
# to_delete = connection.execute(stmt).scalar()

# Build a statement to delete records from the census table: stmt_del
# stmt_del = delete(census)

# Append a where clause to target Men ('M') age 36
# stmt_del = stmt_del.where(
#     and_(census.columns.sex == "M",
#          census.columns.age == 36)
# )

# Execute the statement: results
# results = connection.execute(stmt_del)

# Print affected rowcount and to_delete record count, make sure they match
# print(results.rowcount, to_delete)


# Drop the state_fact table
state_fact.drop(engine)

# Check to see if state_fact exists
print(state_fact.exists(engine))

# Drop all tables
metadata.drop_all(engine)

# Check to see if census exists
print(census.exists(engine))


# Get rid of all tables in the database
metadata.drop_all(engine)
connection.close()
## []
## Table('data', MetaData(bind=None), Column('name', String(length=255), table=<data>), Column('count', Integer(), table=<data>), Column('amount', Float(), table=<data>), Column('valid', Boolean(), table=<data>), schema=None)
## Table('data02', MetaData(bind=None), Column('name', String(length=255), table=<data02>), Column('count', Integer(), table=<data02>, default=ColumnDefault(1)), Column('amount', Float(), table=<data02>), Column('valid', Boolean(), table=<data02>, default=ColumnDefault(False)), schema=None)
## 1
## ('Anna', 1, 1000.0, True)
## 1
## 2
## Table('data', MetaData(bind=None), Column('name', String(length=255), table=<data>), Column('count', Integer(), table=<data>), Column('amount', Float(), table=<data>), Column('valid', Boolean(), table=<data>), schema=None)
## Headers are:  id,state,gender,age,pop2000,pop2010
## 
## 2346
## Table('state_fact', MetaData(bind=None), Column('name', String(length=255), table=<state_fact>), Column('abbreviation', String(length=2), table=<state_fact>), Column('census_division_name', String(length=255), table=<state_fact>), Column('fips_state', Integer(), table=<state_fact>, default=ColumnDefault(0)), Column('notes', String(length=255), table=<state_fact>, default=ColumnDefault('none')), schema=None)
## [('New York', 'NY', '2 (Northeast / Mid-Atlantic)', 0, 'none')]
## 1
## [('New York', 'NY', '2 (Northeast / Mid-Atlantic)', 36, 'none')]
## 8
## 2346
## []
## False
## False

Chapter 5 - Case Study

Census Case Study - three components:

  • Prepare SQLAlchemy and the Database
  • Load data in to the Database
  • Solve Data Science Problems with the Database

Populating the Database - using CSV file from the Census:

  • Define an empty list
  • Loop over the rows of the CSV
  • Make each row in to a dictionary
  • Append each dictionary to the list
  • Then, add everything to the table
    • stmt = insert(employees)
    • result_proxy = connection.execute(stmt, values_list)

Example Queries:

  • Average age by gender
  • Percentage by gender by state
  • Difference in 2008 vs 2000 populations

Example code includes:


myPath = "./PythonInputFiles/"



import pandas as pd


# Import sqlalchemy functions
from sqlalchemy import create_engine, MetaData, Table, select, func, desc
from sqlalchemy import Table, Column, String, Integer, Float, Boolean


# Define an engine to connect to chapter5.sqlite: engine
engine = create_engine('sqlite:///' + myPath + 'chapter5.sqlite')

# Initialize MetaData: metadata
metadata = MetaData()


# Build a census table: census
census = Table('census', metadata,
               Column('state', String(30)),
               Column("gender", String(6)),
               Column("age", Float()),
               Column("pop2000", Integer()),
               Column("pop2010", Integer()),
               Column("ageText", String(30))
               )

# Create the table in the database
metadata.create_all(engine)

# Create mapping of text ages to numeric ages
import numpy as np
tmpAge = list(pd.read_csv(myPath + "_notuse_census2000.csv")["age"].unique())
tmpNum = [np.mean([int(x.split("to")[0]), int(x.split("to")[1])]) if x.find("to") > -1 else 0 for x in tmpAge]
tmpNum[tmpAge.index("20")] = 20
tmpNum[tmpAge.index("21")] = 21
tmpNum[tmpAge.index("lt5")] = 2.5
tmpNum[tmpAge.index("ge85")] = 90

# Create an empty list: values_list
values_list = []

# Iterate over the rows
for idx, row in enumerate(open(myPath + "_notuse_census2000.csv", "r")):
    if idx == 0 : 
        print("Headers are: ", row)
        continue
    
    # Create a dictionary with the values
    rowItems = row.split(",")
    ageNum = tmpNum[tmpAge.index(rowItems[3])]
    data = {'state': rowItems[1], 'gender': rowItems[2], 'age': ageNum, 'pop2000': int(rowItems[4]),
            'pop2010': int(rowItems[5]), 'ageText':rowItems[3]}
    values_list.append(data)

# Import insert
from sqlalchemy import insert

# Build insert statement: stmt
stmt = insert(census)

# Use values_list to insert data: results
connection = engine.connect()
results = connection.execute(stmt, values_list)

# Print rowcount
print(results.rowcount)


# Import select
from sqlalchemy import select

# Calculate weighted average age: stmt
stmt = select([census.columns.gender,
               (func.sum(census.columns.age * census.columns.pop2010) /
                func.sum(census.columns.pop2010)).label("average_age")
               ])

# Group by sex
stmt = stmt.group_by(census.columns.gender)

# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()


# Print the average age by sex
for x in results:
    print(x[0], x[1])


# import case, cast and Float from sqlalchemy
from sqlalchemy import case, cast, Float

# Build a query to calculate the percentage of females in 2010: stmt
stmt = select([census.columns.state,
    (func.sum(
        case([
            (census.columns.gender == 'female', census.columns.pop2010)
        ], else_=0)) /
     cast(func.sum(census.columns.pop2010), Float) * 100).label('percent_female')
])

# Group By state
stmt = stmt.group_by(census.columns.state)

# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()

# Plot the results by state
import matplotlib.pyplot as plt

pctFemale = [y for x, y in results]
pctState = [x for x, y in results]
myDF = pd.DataFrame( {"% female":pd.to_numeric(pctFemale)}, index=pctState )
myDF.sort_values("% female", ascending=False).plot(kind="bar", ylim=(46, 54))
plt.title("% Female by State (2010 Census)")
# plt.show()
plt.savefig("_dummyPy075.png", bbox_inches="tight")
plt.clf()



# Print the percentage
# for result in results:
#     print(result.state, result.percent_female)


# Build query to return state name and population difference from 2008 to 2000
stmt = select([census.columns.state,
     (census.columns.pop2010 - census.columns.pop2000).label('pop_change')
])

# Group by State
stmt = stmt.group_by(census.columns.state)

# Order by Population Change
stmt = stmt.order_by(desc("pop_change"))

# Limit to top 10
stmt = stmt.limit(10)

# Use connection to execute the statement and fetch all results
results = connection.execute(stmt).fetchall()

# Print the state and population change for each record
for result in results:
    print('{}:{}'.format(result.state, result.pop_change))



# Calculate average age by state (2010)
stmt = select([census.columns.state,
               (func.sum(census.columns.age * census.columns.pop2010) /
                func.sum(census.columns.pop2010)).label("average_age")
               ])

# Group by sex
stmt = stmt.group_by(census.columns.state)

# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()

myDF2 = pd.DataFrame( {"Avg. Age":pd.to_numeric([y for x, y in results])}, index=[x for x, y in results] )
myDF2.sort_values("Avg. Age", ascending=False).plot(kind="bar", ylim=(30, 45))
plt.title("Average Age by State (2010 Census)")
# plt.show()
plt.savefig("_dummyPy076.png", bbox_inches="tight")
plt.clf()


# Delete the DB
# Get rid of all tables in the database
metadata.drop_all(engine)
connection.close()
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\sqlalchemy\sql\sqltypes.py:596: SAWarning: Dialect sqlite+pysqlite does *not* support Decimal objects natively, and SQLAlchemy must convert from floating point - rounding errors and other issues may occur. Please consider storing Decimal numbers as strings or integers on this platform for lossless storage.
##   'storage.' % (dialect.name, dialect.driver))
## Headers are:  id,state,gender,age,pop2000,pop2010
## 
## 2392
## female 38.46474229575023
## male 36.27156086981655
## Florida:22065
## Illinois:15716
## Texas:14908
## Indiana:6848
## Massachusetts:6111
## Virginia:5374
## Tennessee:5102
## Connecticut:4984
## Louisiana:4345
## North Carolina:3406

% Female (2010 Census) by State:

Average Age (2010 Census) by State:

Data Types for Data Science

Chapter 1 - Fundamental Data Types

Introduction and lists - “container sequences” hold other types of data:

  • Container sequences can be mutable (list, set) or immutable (tuple)
    • Can iterate over container sequences also
  • Lists hold data in the order that it was added
    • Mutable
    • Indexed
  • Adding items to an existing list
    • myList.append(newItem)
    • myList[2] # extrac the third item
    • myListA + myListB # will be a single list, with the items from myListB at the end
  • Finding and removing items in a list
    • myList.index(myItem) # returns the index position of the first occreunce of myItem
    • myList.pop(myIndex) # returns the item at myIndex position, AND ALSO removes the item from the list
  • Iterating and Sorting
    • for item in myList:
    • sorted(myList) # produces a sorted version of myList

Tuples - somewhat like a list in how they hold data, but with key differences:

  • Tuples are much more memory efficient than lists, though they are also immutable
    • Immutability also has advantages of certainty - knowing that the data will not be modified
  • Zipping and Unpacking are common actions taken in the tuple space
    • zip() is a common method of creatin tuples - zip(listA, listB) will create 2-ples with (listA[0], listB[0]), (listA[1], listB[1]), etc. - technically, creates an iterator
    • Unpacking (expanding) tuples is also common and expressive - a, b = myTuple will extract item0 as a and item1 as b
  • Tuple unpacking can be especially powerful in loops
    • for a, b in myTupleList:
  • The enumerate() function creates tuples where the first item is the index and the second item is the item
    • for idx, item in enumerate(myTuples): a, b = item; print(idx, a, b)
  • Beware of trailing commas - “item2 = ‘butter’”, will create a tuple (“butter”, )

Sets for unordered and unique data - excellent for finding all the unique values:

  • Sets are for storing unique and unordered items; they are also mutable
    • mySet = set(myList)
  • Several options for modifying sets
    • .add() will add the item if it does not already exist, and ignore it if it does
    • .update() will merge in another set, again only adding the items that do not already exist
    • .discard() will “safely” remove an item from the set, which is to say that no error is thrown even if the item is not in the set
    • .pop() will remove and return an arbitrary element from the set; will throw an error if it is empty; defaults to the first item of the list???
  • Several options for assessing similarities and differences among sets
    • setA.union(setB) returns a set of everything in either
    • setA.intersection(setB) returns a set of everything in both
    • setA.difference(setB) returns everything in setA that is not in setB

Example code includes:


myPath = "./PythonInputFiles/"



# Create a list containing the names: baby_names
baby_names = ['Ximena', 'Aliza', 'Ayden', 'Calvin']

# Extend baby_names with 'Rowen' and 'Sandeep'
baby_names.extend(['Rowen', 'Sandeep'])

# Print baby_names
print(baby_names)

# Find the position of 'Aliza': position
position = baby_names.index("Aliza")

# Remove 'Aliza' from baby_names
baby_names.pop(position)

# Print baby_names
print(baby_names)


# A list of lists, records has been pre-loaded. If you explore it in the IPython Shell, you'll see that each entry is a list of this form:  ['2011', 'FEMALE', 'HISPANIC', 'GERALDINE', '13', '75']
# Dummy up something similar from the SSA data
import pandas as pd
pd2011 = pd.read_csv(myPath + "yob2011.txt", header=None, names=["Name", "Gender", "Count"])


# Speed the processing - keep only names with Count >= 5000
records2011 = []
for idx in pd2011.loc[pd2011["Count"] >= 5000].index:
    rowData = pd2011.loc[idx]
    newList = ["2011", rowData["Gender"], "NA", rowData["Name"], "NA", rowData["Count"]]
    records2011.append(newList)



# Create the empty list: baby_names
baby_names = []

# Loop over a list of records 
for row in records2011:
    # Add the name found in column 3 to the list
    baby_names.append(row[3])

# Sort the names in alphabetical order
for name in sorted(baby_names):
    # Print each name
    print(name)


girl_names = ['GRACE', 'Victoria', 'Rachel', 'Anna', 'Samantha', 'Kayla', 'Claire', 'Ashley', 'Zoe', 'Alina', 'Angela', 'Olivia', 'AVA', 'Valentina', 'CAMILA', 'Miriam', 'MADISON', 'Aaliyah', 'RACHEL', 'Serenity', 'EMILY', 'Mia', 'Chloe', 'MIA', 'LONDON', 'Chana', 'TAYLOR', 'CHLOE', 'FIONA', 'Camila', 'GABRIELLE', 'SOPHIA', 'CHANA', 'LEAH', 'ELLA', 'GENESIS', 'Madison', 'Emily', 'NEVAEH', 'ASHLEY', 'Isabella', 'ISABELLA', 'Sophia', 'OLIVIA', 'Leah', 'Esther', 'Mariam', 'JADA', 'London', 'TIFFANY', 'SERENITY', 'Emma', 'Savannah', 'CHAYA', 'KAYLA', 'SOFIA', 'ABIGAIL', 'Grace', 'Chaya', 'Taylor', 'ANGELA', 'Sarah', 'Brielle', 'MAKAYLA', 'EMMA', 'ESTHER', 'Ava', 'AALIYAH', 'HAILEY', 'MIRIAM', 'Skylar', 'SARAH', 'Fatoumata', 'Sofia']
boy_names = ['ANGEL', 'Jacob', 'Josiah', 'Daniel', 'CHRISTIAN', 'William', 'MASON', 'Eric', 'JUSTIN', 'LUCAS', 'Mason', 'TYLER', 'Elijah', 'Noah', 'ISAIAH', 'JEREMIAH', 'JOSHUA', 'JAYDEN', 'Samuel', 'KEVIN', 'AIDEN', 'James', 'Aiden', 'Alexander', 'ELIJAH', 'Benjamin', 'Jeremiah', 'Liam', 'Carter', 'ANTHONY', 'Ryan', 'DAVID', 'DANIEL', 'Joshua', 'JAMES', 'Joseph', 'JACOB', 'RYAN', 'Dylan', 'Ethan', 'JACK', 'NOAH', 'David', 'SAMUEL', 'Lucas', 'Matthew', 'Jack', 'Jason', 'ALEXANDER', 'MATTHEW', 'Michael', 'Jayden', 'MOSHE', 'ETHAN', 'JOSEPH', 'MUHAMMAD', 'SEBASTIAN', 'BENJAMIN', 'Moshe', 'Amir', 'Sebastian', 'MICHAEL', 'CHRISTOPHER', 'Angel', 'JOSIAH', 'ERIC', 'JASON', 'Muhammad']

# Pair up the boy and girl names: pairs
pairs = zip(girl_names, boy_names)

# Iterate over pairs
for idx, pair in enumerate(pairs):
    # Unpack pair: girl_name, boy_name
    girl_name, boy_name = pair
    # Print the rank and names associated with each rank
    print('Rank {}: {} and {}'.format(idx, girl_name, boy_name))


# Create the normal variable: normal
normal = "simple"

# Create the mistaken variable: error
error = 'trailing comma',

# Print the types of the variables
print(type(normal))
print(type(error))


# Same SSA process for 2014 baby names
pd2014 = pd.read_csv(myPath + "yob2014.txt", header=None, names=["Name", "Gender", "Count"])

# Speed the processing - keep only names with Count >= 5000
records2014 = []
for idx in pd2014.loc[pd2014["Count"] >= 5000].index:
    rowData = pd2014.loc[idx]
    newList = ["2014", rowData["Gender"], "NA", rowData["Name"], "NA", rowData["Count"]]
    records2014.append(newList)


# Convert them to sets (only names with 5,000+)
baby_names_2011 = set(pd2011.loc[pd2011["Count"] >= 5000]["Name"])
baby_names_2014 = set(pd2014.loc[pd2014["Count"] >= 5000]["Name"])


# Find the union: all_names
all_names = baby_names_2011.union(baby_names_2014)

# Print the count of names in all_names
print(len(all_names))

# Find the intersection: overlapping_names
overlapping_names = baby_names_2011.intersection(baby_names_2014)

# Print the count of names in overlapping_names
print(len(overlapping_names))


# Create the empty set: baby_names_2011
baby_names_2011 = set()

# Loop over records and add the names from 2011 to the baby_names_2011 set
for row in records2011:
    # Check if the first column is '2011'
    if row[0] == '2011':
        # Add the fourth column to the set
        baby_names_2011.add(row[3])

# Find the difference between 2011 and 2014: differences
differences = baby_names_2011.difference(baby_names_2014)

# Print the differences
print(differences)
## ['Ximena', 'Aliza', 'Ayden', 'Calvin', 'Rowen', 'Sandeep']
## ['Ximena', 'Ayden', 'Calvin', 'Rowen', 'Sandeep']
## Aaliyah
## Aaron
## Abigail
## Adam
## Addison
## Adrian
## Aiden
## Alexander
## Alexis
## Allison
## Alyssa
## Amelia
## Andrew
## Angel
## Anna
## Anthony
## Ashley
## Aubrey
## Audrey
## Austin
## Ava
## Avery
## Ayden
## Benjamin
## Bentley
## Blake
## Brandon
## Brayden
## Brianna
## Brody
## Brooklyn
## Caleb
## Cameron
## Carter
## Charles
## Charlotte
## Chase
## Chloe
## Christian
## Christopher
## Colton
## Connor
## Cooper
## Daniel
## David
## Dominic
## Dylan
## Eli
## Elijah
## Elizabeth
## Ella
## Emily
## Emma
## Ethan
## Evan
## Evelyn
## Gabriel
## Gabriella
## Gavin
## Grace
## Hailey
## Hannah
## Henry
## Hunter
## Ian
## Isaac
## Isabella
## Isaiah
## Jack
## Jackson
## Jacob
## James
## Jason
## Jayden
## Jeremiah
## John
## Jonathan
## Jordan
## Jose
## Joseph
## Joshua
## Josiah
## Julian
## Justin
## Kaylee
## Kevin
## Landon
## Layla
## Leah
## Levi
## Liam
## Lillian
## Lily
## Logan
## Lucas
## Luke
## Madison
## Mason
## Matthew
## Mia
## Michael
## Natalie
## Nathan
## Nevaeh
## Nicholas
## Noah
## Oliver
## Olivia
## Owen
## Parker
## Riley
## Robert
## Ryan
## Samantha
## Samuel
## Sarah
## Savannah
## Sebastian
## Sofia
## Sophia
## Taylor
## Thomas
## Tyler
## Victoria
## William
## Wyatt
## Xavier
## Zachary
## Zoe
## Zoey
## Rank 0: GRACE and ANGEL
## Rank 1: Victoria and Jacob
## Rank 2: Rachel and Josiah
## Rank 3: Anna and Daniel
## Rank 4: Samantha and CHRISTIAN
## Rank 5: Kayla and William
## Rank 6: Claire and MASON
## Rank 7: Ashley and Eric
## Rank 8: Zoe and JUSTIN
## Rank 9: Alina and LUCAS
## Rank 10: Angela and Mason
## Rank 11: Olivia and TYLER
## Rank 12: AVA and Elijah
## Rank 13: Valentina and Noah
## Rank 14: CAMILA and ISAIAH
## Rank 15: Miriam and JEREMIAH
## Rank 16: MADISON and JOSHUA
## Rank 17: Aaliyah and JAYDEN
## Rank 18: RACHEL and Samuel
## Rank 19: Serenity and KEVIN
## Rank 20: EMILY and AIDEN
## Rank 21: Mia and James
## Rank 22: Chloe and Aiden
## Rank 23: MIA and Alexander
## Rank 24: LONDON and ELIJAH
## Rank 25: Chana and Benjamin
## Rank 26: TAYLOR and Jeremiah
## Rank 27: CHLOE and Liam
## Rank 28: FIONA and Carter
## Rank 29: Camila and ANTHONY
## Rank 30: GABRIELLE and Ryan
## Rank 31: SOPHIA and DAVID
## Rank 32: CHANA and DANIEL
## Rank 33: LEAH and Joshua
## Rank 34: ELLA and JAMES
## Rank 35: GENESIS and Joseph
## Rank 36: Madison and JACOB
## Rank 37: Emily and RYAN
## Rank 38: NEVAEH and Dylan
## Rank 39: ASHLEY and Ethan
## Rank 40: Isabella and JACK
## Rank 41: ISABELLA and NOAH
## Rank 42: Sophia and David
## Rank 43: OLIVIA and SAMUEL
## Rank 44: Leah and Lucas
## Rank 45: Esther and Matthew
## Rank 46: Mariam and Jack
## Rank 47: JADA and Jason
## Rank 48: London and ALEXANDER
## Rank 49: TIFFANY and MATTHEW
## Rank 50: SERENITY and Michael
## Rank 51: Emma and Jayden
## Rank 52: Savannah and MOSHE
## Rank 53: CHAYA and ETHAN
## Rank 54: KAYLA and JOSEPH
## Rank 55: SOFIA and MUHAMMAD
## Rank 56: ABIGAIL and SEBASTIAN
## Rank 57: Grace and BENJAMIN
## Rank 58: Chaya and Moshe
## Rank 59: Taylor and Amir
## Rank 60: ANGELA and Sebastian
## Rank 61: Sarah and MICHAEL
## Rank 62: Brielle and CHRISTOPHER
## Rank 63: MAKAYLA and Angel
## Rank 64: EMMA and JOSIAH
## Rank 65: ESTHER and ERIC
## Rank 66: Ava and JASON
## Rank 67: AALIYAH and Muhammad
## <class 'str'>
## <class 'tuple'>
## 143
## 113
## {'Brianna', 'Alexis', 'Cooper', 'Nevaeh', 'Sarah', 'Xavier', 'Brody', 'Blake', 'Alyssa', 'Justin', 'Riley', 'Hailey', 'Taylor', 'Ashley', 'Bentley', 'Kaylee', 'Aaliyah'}

Chapter 2 - Dictionaries

Using dictionaries - “everything in Python is a dictionary” is a common joke:

  • Dictionaries hold values in key/value pairs - the key is often text, while the value can be anything - text, number, container, etc.
  • Dictionaries can be nested within dictionaries, and are iterable as well
  • General process for working with dictionaries
    • Dictionaries are created by dict() or {}
    • myDict[key] = value # general process for adding key/values to the dictionary
    • myDict[fakeKey] # will throw an error if fakeKey is not already a key in the dictionary
    • myDict.get(fakeKey) # will safely return None (or a user-specified default) if the fakeKey is not in the dictionary, and myDic[fakeKey] if it is
  • Additional details on nested data - example of a dictionary “art_galleries” that is keyed by ZIP Code, with values being another dictionary of Gallery (key) - Phone (value)
    • art_galleries.keys() will return all of the keys in art_galleries
    • art_galleries[keyZIP][keyGallery] # returns the phone number of keyGallery in keyZIP
    • Can also provide multiple calls to the .get() method to avoid the “cannot find key” problem

Altering dictionaries - dictionaries are mutable:

  • Adding key/value pairs to a dictionary
    • Can assign a single key/value just as above = myDict[newKey] = newValues
    • Can extend from another dictionary or from tuples using .update()
    • Suppose that galleries_11234 = [ (“Joe”, 200) , (“Jane”, 300) ]
    • art_galleries[“11234”].update(galleries_11234) # add key/value pairs from the tuples in galleries_11234 nested in the “11234” entry
  • Popping and deleting from dictionaries
    • Can delete a single key using del myDict[delKey] # will throw an error if delKey does not exist in myDict
    • The .pop() method is a safer way to remove the keys from the dictionary - extracts/deletes the value if it exists, does nothing otherwise

Pythonically using dictionaries - efficient means of interacting with dictionaries:

  • The .items() will return an iterable of key-value tuples
  • The in operator is a more efficient and clever way to check whether something exists in a dictionary (as opposed to .get())
    • testKey in myDict # will return True if this is a key and False if it is not

Working with CSV files (comma separated values files) - one of the most common storage systems:

  • Example of reading from a CSV file using a CSV reader - using the “csv” module in Python and the open() function
    • The csv.reader() will read the lines of the file as tuples, while .close() will then end the connection import csv
    • csvFile = open(“myFile.csv”, “r”)
    • for row in csv.reader(csvFile): print(row)
    • csvFile.close()
  • Another option for creating a dictionary from a CSV file is to use DictReader
    • If the data have a header, then that is used
    • If otherwise, then you can pass in the column names
    • for row in csv.DictReader(csvFile): print(row) # this is now an ‘ordered dictionary’, a concept explained in more detail in later chapters

Example code includes:


myPath = "./PythonInputFiles/"



# Create top-50 female_baby_names_2012 as list of (name, rank) tuples
import pandas as pd

pd2012 = pd.read_csv(myPath + "yob2012.txt", header=None, names=["Name", "Gender", "Count"])
babyTop = pd2012.loc[pd2012["Gender"] == "F"].sort_values("Count", ascending=False)
female_baby_names_2012 = list(zip(babyTop["Name"][0:50], list(range(1, 51))))



# Create an empty dictionary: names
names = {}

# Loop over the girl names
for name, rank in female_baby_names_2012:
    # Add each name to the names dictionary using rank as the key
    names[rank] = name

# Sort the names list by rank in descending order and slice the first 10 items (popularity 41-50)
for rank in sorted(names, reverse=True)[:10]:
    # Print each item
    print(names[rank])


# Safely print rank 7 from the names dictionary
print(names.get(7))

# Safely print the type of rank 100 from the names dictionary
print(type(names.get(100)))

# Safely print rank 105 from the names dictionary or 'Not Found'
print(names.get(105, "Not Found"))



# Create the boy_names dictionary - start with 2013
pd2013 = pd.read_csv(myPath + "yob2013.txt", header=None, names=["Name", "Gender", "Count"])
boyTop = pd2013.loc[pd2013["Gender"] == "M"].sort_values("Count", ascending=False)
male_baby_names_2013 = list( zip( list(range(1, 51)), boyTop["Name"][0:50] ) )

boyTop = pd2012.loc[pd2012["Gender"] == "M"].sort_values("Count", ascending=False)
male_baby_names_2012 = list( zip( list(range(1, 51)), boyTop["Name"][0:50] ) )

pd2011 = pd.read_csv(myPath + "yob2011.txt", header=None, names=["Name", "Gender", "Count"])
boyTop = pd2011.loc[pd2011["Gender"] == "M"].sort_values("Count", ascending=False)
male_baby_names_2011 = list( zip( list(range(1, 51)), boyTop["Name"][0:50] ) )

pd2014 = pd.read_csv(myPath + "yob2014.txt", header=None, names=["Name", "Gender", "Count"])
boyTop = pd2014.loc[pd2014["Gender"] == "M"].sort_values("Count", ascending=False)
male_baby_names_2014 = list( zip( list(range(1, 51)), boyTop["Name"][0:50] ) )


# male_baby_names_2013 is a dictionary of rank-name, nested in dictionary boy_names with key 2013
boy_names = { 2013 : dict(male_baby_names_2013) , 2012 : dict(male_baby_names_2012) , 2014 : dict(male_baby_names_2014)}


# Print a list of keys from the boy_names dictionary
print(boy_names.keys())

# Print a list of keys from the boy_names dictionary for the year 2013
print(boy_names[2013].keys())

# Loop over the dictionary
for year in boy_names:
    # Safely print the year and the third ranked name or 'Unknown'
    print(year, boy_names[year].get(3, "Unknown"))


# Assign the names_2011 dictionary as the value to the 2011 key of boy_names
boy_names[2011] = dict(male_baby_names_2011)

# Update the 2012 key in the boy_names dictionary
boy_names[2012].update([(1, 'Casey'), (2, 'Aiden')])

# Loop over the boy_names dictionary 
for year in boy_names:
    # Loop over and sort the data for each year by descending rank
    for rank in sorted(boy_names[year], reverse=True)[:1]:
        # Check that you have a rank
        if not rank:
            print(year, 'No Data Available')
        # Safely print the year and the least popular name or 'Not Available'
        print(year, boy_names[year].get(rank))



# Make the female_names dictionary of top-10 names by year
girlTop = pd2013.loc[pd2013["Gender"] == "F"].sort_values("Count", ascending=False)
female_baby_names_2013 = list( zip( list(range(1, 11)), girlTop["Name"][0:10] ) )

girlTop = pd2012.loc[pd2012["Gender"] == "F"].sort_values("Count", ascending=False)
female_baby_names_2012 = list( zip( list(range(1, 11)), girlTop["Name"][0:10] ) )

girlTop = pd2011.loc[pd2011["Gender"] == "F"].sort_values("Count", ascending=False)
female_baby_names_2011 = list( zip( list(range(1, 11)), girlTop["Name"][0:10] ) )

girlTop = pd2014.loc[pd2014["Gender"] == "F"].sort_values("Count", ascending=False)
female_baby_names_2014 = list( zip( list(range(1, 11)), girlTop["Name"][0:10] ) )


# female_names_2013 is a nested dictionary
female_names = { 2013 : dict(female_baby_names_2013) , 2012 : dict(female_baby_names_2012) , 2014 : dict(female_baby_names_2014), 2011: dict(female_baby_names_2011) }


# Remove 2011 and store it: female_names_2011
female_names_2011 = female_names.pop(2011)

# Safely remove 2015 with a empty dictionary as the default and store it: female_names_2015
female_names_2015 = female_names.pop(2015, {})

# Delete 2012
del female_names[2012]

# Print female_names
print(female_names)


# Iterate over the 2014 nested dictionary
for rank, name in female_names[2014].items():
    # Print rank and name
    print(rank, name)

# Iterate over the 2013 nested dictionary
for rank, name in female_names[2013].items():
    # Print rank and name
    print(rank, name)


# Check to see if 2011 is in female_names
if 2011 in female_names:
    # Print 'Found 2011'
    print('Found 2011')

# Check to see if rank 1 is in 2013
if 1 in female_names[2013]:
    # Print 'Found Rank 1 in 2013' if found
    print('Found Rank 1 in 2013')
else:
    # Print 'Rank 1 missing from 2013' if not found
    print('Rank 1 missing from 2013')

# Check to see if Rank 100 is in 2013
if 100 in female_names[2013]:
    print('Found Rank 100')
else:
    print('Rank 100 missing from 2013')


# Created top10 female names for 2013 as Year - "F" - "NA" - Name - "NA" - Rank
# topFemale = female_baby_names_2013
# rankData = [a for a, b in topFemale]
# nameData = [b for a, b in topFemale]
# babyData = pd.DataFrame( {"YEAR": 2013, "GENDER": "F", "FILL1": "NA", "NAME": nameData, "FILL2": "NA", "RANK": rankData} )[["YEAR", "GENDER", "FILL1", "NAME", "FILL2", "RANK"]]
# babyData.to_csv(myPath + "baby_names.csv", index=False)


# Import the python CSV module
import csv

# Create a python file object in read mode for the baby_names.csv file: csvfile
csvfile = open(myPath + "baby_names.csv", "r")

baby_names = {}

# Loop over a csv reader on the file object
for row in csv.reader(csvfile):
    # Print each row 
    print(row)
    # Add the rank and name to the dictionary
    if row[5] != "RANK": 
        baby_names[int(row[5])] = row[3]

# Print the dictionary keys
print(baby_names.keys())


# Create a python file object in read mode for the `baby_names.csv` file: csvfile
csvfile = open(myPath + "baby_names.csv", "r")

baby_names = {}

# Loop over a DictReader on the file
for row in csv.DictReader(csvfile):
    # Print each row 
    print(row)
    # Add the rank and name to the dictionary: baby_names
    baby_names[int(row["RANK"])] = row["NAME"]

# Print the dictionary 
print(baby_names.keys())
## Ashley
## Arianna
## Camila
## Riley
## Taylor
## Claire
## Alyssa
## Sarah
## Savannah
## Audrey
## Abigail
## <class 'NoneType'>
## Not Found
## dict_keys([2013, 2012, 2014])
## dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])
## 2013 Liam
## 2012 Ethan
## 2014 Mason
## 2013 Levi
## 2012 Tyler
## 2014 Aaron
## 2011 Julian
## {2013: {1: 'Sophia', 2: 'Emma', 3: 'Olivia', 4: 'Isabella', 5: 'Ava', 6: 'Mia', 7: 'Emily', 8: 'Abigail', 9: 'Madison', 10: 'Elizabeth'}, 2014: {1: 'Emma', 2: 'Olivia', 3: 'Sophia', 4: 'Isabella', 5: 'Ava', 6: 'Mia', 7: 'Emily', 8: 'Abigail', 9: 'Madison', 10: 'Charlotte'}}
## 1 Emma
## 2 Olivia
## 3 Sophia
## 4 Isabella
## 5 Ava
## 6 Mia
## 7 Emily
## 8 Abigail
## 9 Madison
## 10 Charlotte
## 1 Sophia
## 2 Emma
## 3 Olivia
## 4 Isabella
## 5 Ava
## 6 Mia
## 7 Emily
## 8 Abigail
## 9 Madison
## 10 Elizabeth
## Found Rank 1 in 2013
## Rank 100 missing from 2013
## ['YEAR', 'GENDER', 'FILL1', 'NAME', 'FILL2', 'RANK']
## ['2013', 'F', 'NA', 'Sophia', 'NA', '1']
## ['2013', 'F', 'NA', 'Emma', 'NA', '2']
## ['2013', 'F', 'NA', 'Olivia', 'NA', '3']
## ['2013', 'F', 'NA', 'Isabella', 'NA', '4']
## ['2013', 'F', 'NA', 'Ava', 'NA', '5']
## ['2013', 'F', 'NA', 'Mia', 'NA', '6']
## ['2013', 'F', 'NA', 'Emily', 'NA', '7']
## ['2013', 'F', 'NA', 'Abigail', 'NA', '8']
## ['2013', 'F', 'NA', 'Madison', 'NA', '9']
## ['2013', 'F', 'NA', 'Elizabeth', 'NA', '10']
## dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
## OrderedDict([('YEAR', '2013'), ('GENDER', 'F'), ('FILL1', 'NA'), ('NAME', 'Sophia'), ('FILL2', 'NA'), ('RANK', '1')])
## OrderedDict([('YEAR', '2013'), ('GENDER', 'F'), ('FILL1', 'NA'), ('NAME', 'Emma'), ('FILL2', 'NA'), ('RANK', '2')])
## OrderedDict([('YEAR', '2013'), ('GENDER', 'F'), ('FILL1', 'NA'), ('NAME', 'Olivia'), ('FILL2', 'NA'), ('RANK', '3')])
## OrderedDict([('YEAR', '2013'), ('GENDER', 'F'), ('FILL1', 'NA'), ('NAME', 'Isabella'), ('FILL2', 'NA'), ('RANK', '4')])
## OrderedDict([('YEAR', '2013'), ('GENDER', 'F'), ('FILL1', 'NA'), ('NAME', 'Ava'), ('FILL2', 'NA'), ('RANK', '5')])
## OrderedDict([('YEAR', '2013'), ('GENDER', 'F'), ('FILL1', 'NA'), ('NAME', 'Mia'), ('FILL2', 'NA'), ('RANK', '6')])
## OrderedDict([('YEAR', '2013'), ('GENDER', 'F'), ('FILL1', 'NA'), ('NAME', 'Emily'), ('FILL2', 'NA'), ('RANK', '7')])
## OrderedDict([('YEAR', '2013'), ('GENDER', 'F'), ('FILL1', 'NA'), ('NAME', 'Abigail'), ('FILL2', 'NA'), ('RANK', '8')])
## OrderedDict([('YEAR', '2013'), ('GENDER', 'F'), ('FILL1', 'NA'), ('NAME', 'Madison'), ('FILL2', 'NA'), ('RANK', '9')])
## OrderedDict([('YEAR', '2013'), ('GENDER', 'F'), ('FILL1', 'NA'), ('NAME', 'Elizabeth'), ('FILL2', 'NA'), ('RANK', '10')])
## dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Chapter 3 - Collections Module

Counting made easy - collections module (advanced data containers; part of Standard Library):

  • Counter - special dictionary used for counting data (frequency)
    • from collections import Counter
    • nyc_eatery_by_type = Counter(nyc_eatery) # column of nyc_eatery is type
    • nyc_eatery_by_type.most_common(3) # the three most common eatery types

Dictionaries of unknown structure - default dictionaries:

  • Often, the goal is a dictionary with many keys, each containing a (potentially) length list as the values
  • For error handling purposes, would typically need to first create the key-value with an empty list for each key, then myDict[key].append(myNewData) each time
  • The defaultdict works exactly like a dictionary, except that it will create the key if it does not already exist; time saver, and otherwise works just like a dictionary
    • from collections import defaultdict
    • eateries_by_park = defaultdict()
    • for park_id, name in nyc_eateries_parks: eateries_by_park[park_id].append(name)
  • Can also pass a default type argument if strings are not desired
    • eatery_contact_types = defaultdict(int) # this will then allow the += 1 and related commands

Maintaining dictionary order with OrderedDict:

  • Order in Python dictionaries depends on version - as of Python 3.6, dictionaries have become ordered
  • However, even in older versions of Python, this feature was available from the “collections” module
    • from collections import OrderedDict
    • nyc_eatery_permits = OrderedDict()
    • for eatery in nyc_eateries: nyc_eatery_permits[eatery[“end_date”]] = eatery
  • By using .popitem() on an ordered dictionary, you get items back from latest (last) to earliest (first)
    • Alternately, .popitem(last=False) will pull back the items from earlies (first) to latest (last)

Class and Namedtuple - a namedtuple is a tuple where each position has a name:

  • Creating a namedtuple involves passing a name and a list of fields
    • from collections import namedtuple
    • Eatery = namedtuple(“Eatery”, [“name”, “location”, “park_id”, “type_name”])
    • eateries = []
    • for eatery in nyc_eateries:
      • details = Eatery(eatery[“name”], eatery[“location”], eatery[“park_id”], eatery[“type_name”])
      • eateries.append(details)
  • The namedtuple can make the code cleaner, since each field is available as an attribute of the namedtuple
    • The names are available as tuple attributes - for example, myTuple.age will pull the field “age” from myTuple

Example code includes:


myPath = "./PythonInputFiles/"



# Create stations data from the CSV downloaded from Chicago Open Data
# https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Daily-Totals/5neh-572f/data
# Filtered the data to download only 2015-2016
import pandas as pd



statRaw = pd.read_csv(myPath + "CTA_Ridership_Station_Entries_Daily_Totals.csv")
statRaw.head()
len(statRaw["stationname"].value_counts())

# stations originally a list of length 100801 of CTA stations (700 each of 144 stations, plus "station_name")
# Make it a 731 of all days in 2015 and 2016 instead
stations = list(statRaw["stationname"])



# Import the Counter object
from collections import Counter

# Print the first ten items from the stations list
print(stations[:10])

# Create a Counter of the stations list: station_count
station_count = Counter(stations)

# Print the station_count
print(station_count)


# Create a Counter of the stations list: station_count
station_count = Counter(stations)

# Find the 5 most common elements
print(station_count.most_common(5))



# Create entries as an enumerator that can be unpacked to date-stop-riders
# miniStat = statRaw.iloc[0:100, :]
entries = zip(statRaw["date"], statRaw["stationname"], statRaw["rides"])


# Create an empty dictionary: ridership
ridership = {}

# Iterate over the entries
for date, stop, riders in entries:
    # Check to see if date is already in the dictionary
    if date not in ridership:
        # Create an empty list for any missing date
        ridership[date] = []
    # Append the stop and riders as a tuple to the date keys list
    ridership[date].append((stop, riders))

# Print the ridership for '03/09/2016'
print(ridership["03/09/2016"])


# Import defaultdict
from collections import defaultdict

# Create a defaultdict with a default type of list: ridership
ridership = defaultdict(list)


# Need to re-create the enumerator - it is gone when used above!
entries = zip(statRaw["date"], statRaw["stationname"], statRaw["rides"])


# Iterate over the entries
for date, stop, riders in entries:
    # Use the stop as the key of ridership and append the riders to its value
    ridership[stop].append(riders)

# Print the first 10 items of the ridership dictionary
# print(list(ridership.items())[:10])  # a spectacularly bad idea due to length!
[(a, len(x), sum(x)) for a, x in list(ridership.items())[:10]]  # just to get a sense for the data



# Import OrderedDict from collections
from collections import OrderedDict

# Create an OrderedDict called: ridership_date
ridership_date = OrderedDict()


# Need to re-create the enumerator - only want date and riders this time!
entries = zip(statRaw["date"], statRaw["rides"])


# Iterate over the entries
for date, riders in entries:
    # If a key does not exist in ridership_date, set it to 0
    if not date in ridership_date:
        ridership_date[date] = 0
    # Add riders to the date key in ridership_date
    ridership_date[date] += riders

# Print the first 31 records
print(list(ridership_date.items())[:31])


# Print the first key in ridership_date
print(list(ridership_date.keys())[0])

# Pop the first item from ridership_date and print it
print(ridership_date.popitem(last=False))

# Print the last key in ridership_date
print(list(ridership_date.keys())[-1])

# Pop the last item from ridership_date and print it
print(ridership_date.popitem())


# Import namedtuple from collections
from collections import namedtuple

# Create the namedtuple: DateDetails
DateDetails = namedtuple('DateDetails', ['date', 'stop', 'riders'])

# Create the empty list: labeled_entries
labeled_entries = []


# Need to re-create the enumerator - it is gone when used above!
entries = zip(statRaw["date"], statRaw["stationname"], statRaw["rides"])


# Iterate over the entries
for date, stop, riders in entries:
    # Append a new DateDetails namedtuple instance for each entry to labeled_entries
    labeled_entries.append(DateDetails(date, stop, riders))

# Print the first 5 items in labeled_entries
print(labeled_entries[:5])


# Iterate over the first twenty items in labeled_entries
for item in labeled_entries[:20]:
    # Print each item's stop, date, and riders
    print(item.date, item.riders, item.stop)
## ['Austin-Forest Park', 'Harlem-Lake', 'Pulaski-Lake', 'Quincy/Wells', 'Davis', "Belmont-O'Hare", 'Jackson/Dearborn', 'Sheridan', 'Damen-Brown', 'Morse']
## Counter({'Austin-Forest Park': 731, 'Harlem-Lake': 731, 'Pulaski-Lake': 731, 'Quincy/Wells': 731, 'Davis': 731, "Belmont-O'Hare": 731, 'Jackson/Dearborn': 731, 'Sheridan': 731, 'Damen-Brown': 731, 'Morse': 731, '35th/Archer': 731, '51st': 731, 'Dempster-Skokie': 731, 'Pulaski-Cermak': 731, 'LaSalle/Van Buren': 731, 'Ashland-Lake': 731, 'Oak Park-Forest Park': 731, 'Sox-35th-Dan Ryan': 731, 'Randolph/Wabash': 731, 'Damen-Cermak': 731, 'Western-Forest Park': 731, 'Cumberland': 731, '79th': 731, 'Kedzie-Homan-Forest Park': 731, 'State/Lake': 731, 'Main': 731, 'Central-Lake': 731, 'Ashland/63rd': 731, 'Indiana': 731, 'Western-Orange': 731, 'Division/Milwaukee': 731, 'Grand/State': 731, 'Berwyn': 731, 'UIC-Halsted': 731, 'Southport': 731, 'Washington/Dearborn': 731, 'Clark/Lake': 731, 'Forest Park': 731, 'Noyes': 731, 'Cicero-Cermak': 731, 'Clinton-Forest Park': 731, 'California-Cermak': 731, '95th/Dan Ryan': 731, 'Merchandise Mart': 731, 'Racine': 731, 'Cicero-Lake': 731, 'Grand/Milwaukee': 731, 'Garfield-South Elevated': 731, 'Foster': 731, 'Diversey': 731, 'Wilson': 731, "Irving Park-O'Hare": 731, 'Jackson/State': 731, 'California/Milwaukee': 731, '54th/Cermak': 731, 'Damen/Milwaukee': 731, 'Kostner': 731, 'Ridgeland': 731, 'Clark/Division': 731, 'Madison/Wabash': 731, 'North/Clybourn': 731, 'Armitage': 731, 'Western/Milwaukee': 731, 'Adams/Wabash': 731, 'Dempster': 731, 'Laramie': 731, 'Chicago/Franklin': 731, 'East 63rd-Cottage Grove': 731, 'Washington/Wells': 731, 'Western-Cermak': 731, "Harlem-O'Hare": 731, 'Granville': 731, 'Lawrence': 731, 'Central Park': 731, 'Monroe/Dearborn': 731, 'Sedgwick': 731, 'Medical Center': 731, 'Rosemont': 731, '18th': 731, 'South Boulevard': 731, 'Library': 731, 'Francisco': 731, 'Thorndale': 731, "O'Hare Airport": 731, 'Howard': 731, '63rd-Dan Ryan': 731, 'Pulaski-Forest Park': 731, 'Midway Airport': 731, 'Halsted/63rd': 731, 'Pulaski-Orange': 731, 'Cicero-Forest Park': 731, 'Harlem-Forest Park': 731, '69th': 731, 'Cermak-Chinatown': 731, 'Rockwell': 731, 'Logan Square': 731, 'Polk': 731, 'Kedzie-Cermak': 731, 'Linden': 731, 'Ashland-Orange': 731, 'Kedzie-Lake': 731, '47th-South Elevated': 731, 'Monroe/State': 731, '35-Bronzeville-IIT': 731, 'Halsted-Orange': 731, 'King Drive': 731, 'Kedzie-Midway': 731, 'Clinton-Lake': 731, 'Garfield-Dan Ryan': 731, 'Kedzie-Brown': 731, 'Jarvis': 731, 'Argyle': 731, 'Wellington': 731, 'Fullerton': 731, '47th-Dan Ryan': 731, "Addison-O'Hare": 731, 'Central-Evanston': 731, 'Austin-Lake': 731, '43rd': 731, 'Jefferson Park': 731, 'Kimball': 731, 'Loyola': 731, 'Paulina': 731, 'Belmont-North Main': 731, "Montrose-O'Hare": 731, 'LaSalle': 731, 'Oak Park-Lake': 731, 'California-Lake': 731, 'Bryn Mawr': 731, 'Roosevelt': 731, 'Chicago/Milwaukee': 731, 'Addison-North Main': 731, '87th': 731, 'Addison-Brown': 731, 'Chicago/State': 731, 'Irving Park-Brown': 731, 'Western-Brown': 731, 'Harrison': 731, 'Montrose-Brown': 731, 'Morgan-Lake': 731, 'Lake/State': 731, 'Conservatory': 731, 'Oakton-Skokie': 731, 'Cermak-McCormick Place': 731})
## [('Austin-Forest Park', 731), ('Harlem-Lake', 731), ('Pulaski-Lake', 731), ('Quincy/Wells', 731), ('Davis', 731)]
## [('Austin-Forest Park', 2128), ('Harlem-Lake', 3769), ('Pulaski-Lake', 1502), ('Quincy/Wells', 8139), ('Davis', 3656), ("Belmont-O'Hare", 5294), ('Jackson/Dearborn', 8369), ('Sheridan', 5823), ('Damen-Brown', 3048), ('Morse', 4826), ('35th/Archer', 3450), ('51st', 1033), ('Dempster-Skokie', 1697), ('Pulaski-Cermak', 1259), ('LaSalle/Van Buren', 3104), ('Ashland-Lake', 2486), ('Oak Park-Forest Park', 1882), ('Sox-35th-Dan Ryan', 4967), ('Randolph/Wabash', 9659), ('Damen-Cermak', 1572), ('Western-Forest Park', 1819), ('Cumberland', 4589), ('79th', 7476), ('Kedzie-Homan-Forest Park', 2256), ('State/Lake', 10594), ('Main', 1129), ('Central-Lake', 2145), ('Ashland/63rd', 1302), ('Indiana', 919), ('Western-Orange', 3958), ('Division/Milwaukee', 6580), ('Grand/State', 10949), ('Berwyn', 3539), ('UIC-Halsted', 7523), ('Southport', 3467), ('Washington/Dearborn', 12365), ('Clark/Lake', 21640), ('Forest Park', 3636), ('Noyes', 941), ('Cicero-Cermak', 1271), ('Clinton-Forest Park', 4016), ('California-Cermak', 1627), ('95th/Dan Ryan', 11509), ('Merchandise Mart', 8345), ('Racine', 2598), ('Cicero-Lake', 1485), ('Grand/Milwaukee', 2851), ('Garfield-South Elevated', 1413), ('Foster', 963), ('Diversey', 5771), ('Wilson', 6470), ("Irving Park-O'Hare", 4808), ('Jackson/State', 12445), ('California/Milwaukee', 5413), ('54th/Cermak', 2170), ('Damen/Milwaukee', 7022), ('Kostner', 556), ('Ridgeland', 1353), ('Clark/Division', 8216), ('Madison/Wabash', 0), ('North/Clybourn', 6360), ('Armitage', 4575), ('Western/Milwaukee', 5511), ('Adams/Wabash', 9666), ('Dempster', 788), ('Laramie', 1328), ('Chicago/Franklin', 6868), ('East 63rd-Cottage Grove', 1135), ('Washington/Wells', 8267), ('Western-Cermak', 1182), ("Harlem-O'Hare", 3202), ('Granville', 3762), ('Lawrence', 3355), ('Central Park', 1342), ('Monroe/Dearborn', 7972), ('Sedgwick', 4004), ('Medical Center', 3581), ('Rosemont', 6101), ('18th', 2028), ('South Boulevard', 813), ('Library', 4127), ('Francisco', 1617), ('Thorndale', 3355), ("O'Hare Airport", 9742), ('Howard', 5935), ('63rd-Dan Ryan', 3500), ('Pulaski-Forest Park', 2110), ('Midway Airport', 8698), ('Halsted/63rd', 839), ('Pulaski-Orange', 5663), ('Cicero-Forest Park', 1475), ('Harlem-Forest Park', 1185), ('69th', 5790), ('Cermak-Chinatown', 4312), ('Rockwell', 1996), ('Logan Square', 7536), ('Polk', 3750), ('Kedzie-Cermak', 1181), ('Linden', 817), ('Ashland-Orange', 1637), ('Kedzie-Lake', 1753), ('47th-South Elevated', 1347), ('Monroe/State', 11264), ('35-Bronzeville-IIT', 1901), ('Halsted-Orange', 3162), ('King Drive', 651), ('Kedzie-Midway', 3552), ('Clinton-Lake', 4278), ('Garfield-Dan Ryan', 3676), ('Kedzie-Brown', 2039), ('Jarvis', 1817), ('Argyle', 3152), ('Wellington', 3242), ('Fullerton', 15150), ('47th-Dan Ryan', 3331), ("Addison-O'Hare", 3563), ('Central-Evanston', 802), ('Austin-Lake', 1994), ('43rd', 1090), ('Jefferson Park', 7112), ('Kimball', 4236), ('Loyola', 4712), ('Paulina', 2895), ('Belmont-North Main', 12936), ("Montrose-O'Hare", 2529), ('LaSalle', 3556), ('Oak Park-Lake', 1561), ('California-Lake', 1125), ('Bryn Mawr', 4888), ('Roosevelt', 11055), ('Chicago/Milwaukee', 4605), ('Addison-North Main', 6719), ('87th', 4473), ('Addison-Brown', 2754), ('Chicago/State', 13946), ('Irving Park-Brown', 3268), ('Western-Brown', 4273), ('Harrison', 4750), ('Montrose-Brown', 2875), ('Morgan-Lake', 2700), ('Lake/State', 21708), ('Conservatory', 999), ('Oakton-Skokie', 839), ('Cermak-McCormick Place', 1208)]
## [('01/01/2015', 233956), ('01/02/2015', 432144), ('01/03/2015', 273207), ('01/04/2015', 217632), ('01/05/2015', 538868), ('01/06/2015', 556918), ('01/07/2015', 416984), ('01/08/2015', 475074), ('01/09/2015', 524144), ('01/10/2015', 282850), ('01/11/2015', 227240), ('01/12/2015', 605068), ('01/13/2015', 609226), ('01/14/2015', 608109), ('01/15/2015', 622792), ('01/16/2015', 612833), ('01/17/2015', 335555), ('01/18/2015', 244490), ('01/19/2015', 411497), ('01/20/2015', 618377), ('01/21/2015', 619945), ('01/22/2015', 623914), ('01/23/2015', 612177), ('01/24/2015', 333440), ('01/25/2015', 226964), ('01/26/2015', 605287), ('01/27/2015', 626168), ('01/28/2015', 625531), ('01/29/2015', 622695), ('01/30/2015', 618395), ('01/31/2015', 337018)]
## 01/01/2015
## ('01/01/2015', 233956)
## 12/31/2016
## ('12/31/2016', 295002)
## [DateDetails(date='01/01/2015', stop='Austin-Forest Park', riders=587), DateDetails(date='01/01/2015', stop='Harlem-Lake', riders=1106), DateDetails(date='01/01/2015', stop='Pulaski-Lake', riders=811), DateDetails(date='01/01/2015', stop='Quincy/Wells', riders=1117), DateDetails(date='01/01/2015', stop='Davis', riders=1400)]
## 01/01/2015 587 Austin-Forest Park
## 01/01/2015 1106 Harlem-Lake
## 01/01/2015 811 Pulaski-Lake
## 01/01/2015 1117 Quincy/Wells
## 01/01/2015 1400 Davis
## 01/01/2015 2023 Belmont-O'Hare
## 01/01/2015 1730 Jackson/Dearborn
## 01/01/2015 2616 Sheridan
## 01/01/2015 751 Damen-Brown
## 01/01/2015 2433 Morse
## 01/01/2015 862 35th/Archer
## 01/01/2015 430 51st
## 01/01/2015 542 Dempster-Skokie
## 01/01/2015 491 Pulaski-Cermak
## 01/01/2015 270 LaSalle/Van Buren
## 01/01/2015 833 Ashland-Lake
## 01/01/2015 416 Oak Park-Forest Park
## 01/01/2015 1862 Sox-35th-Dan Ryan
## 01/01/2015 2267 Randolph/Wabash
## 01/01/2015 451 Damen-Cermak

Chapter 4 - Handling Dates and Times

DateTime journey - leap years, different length months, time zones, holidays, etc.:

  • The datetime module in Python is part of the standard library (there is also a datetime type inside the datetime module)
  • Parsing existing strings in to datetime objects is accomplished using .strptime()
    • from datetime import datetime
    • parking_violations_date = “06/11/2016”
    • date_dt = datetime.strptime(parking_violations_date, “%m/%d/%Y”)
  • Time format strings are common across many programming languages, and originated in C
  • Converting an existing datetime object to a string is accomplished using .strftime()
    • date_dt.strftime(“%m/%d/%Y”)
  • The .isoformat() method outputs a datetime as an ISO standard string
    • date_dt.isoformat()

Working with DateTime components and current time:

  • All the parts of a datetime object are available as attributes - day, month, year, hour, minute, second, and more - great for grouping data
    • daily_violations = default_dict(int)
    • for violations in parking_violations:
      • violation_date = datetime.strptime(violation[4], “%m/%d/%Y”)
      • daily_violations[violation_date.day] += 1
    • print(sorted(daily_violations.items()))
  • Can grab the current time using .now() for local time zone and .utcnow() for grabbling current UTC time
  • Datatime objects can be defined as “naïve” (unaware of timezones) or “aware” (timezone encoded in the object)
    • An “aware” datetime object will also have an .astimezone() method for converting to other timezones
    • Timezone data is available in the pytz module via the timezone object
    • ny_dt = myNaive.replace(tzinfo=“US/Eastern”)
    • la_dt = ny_dt.astimezone(“US/Central”)

Adding and subtracting time - the timedelta object:

  • The timedelta object (once created) can be added or subtracted from any other datetime object
    • from datetime import timedelta
    • flashback = timedelta(days=90)
    • print(record_dt - flashback, record_dt + flashback)
  • Can also get the timedelta between two objects as the return value
    • time_diff = myTimeA - myTimeB
    • type(time_diff) will be timedelta

Libraries to simplify this process:

  • Parsing time with pendulum - can just use pendulum.parse(“dateString”, tz=“US/Eastern”) and it will attempt to parse a datetime
  • The pendulum module also has strong support for timezone hopping
    • The .in_timezone() method converts a pendulum time object to a desired timezone
    • The .now() method accepts a timezone you want to get the current time for
  • The pendulum module also helps to “humanize” time differences
    • .in_words() provides the difference in a more parseable manner
    • .in_days() will show the difference in days

Example code includes:


myPath = "./PythonInputFiles/"



from collections import defaultdict

dates_list = ['02/19/2001', '04/10/2001', '05/30/2001', '07/19/2001', '09/07/2001', '10/27/2001', '12/16/2001', '02/04/2002', '03/26/2002', '05/15/2002', '07/04/2002', '08/23/2002', '10/12/2002', '12/01/2002', '01/20/2003', '03/11/2003', '04/30/2003', '06/19/2003', '08/08/2003', '09/27/2003', '11/16/2003', '01/05/2004', '02/24/2004', '04/14/2004', '06/03/2004', '07/23/2004', '09/11/2004', '10/31/2004', '12/20/2004', '02/08/2005', '03/30/2005', '05/19/2005', '07/08/2005', '08/27/2005', '10/16/2005', '12/05/2005', '01/24/2006', '03/15/2006', '05/04/2006', '06/23/2006', '08/12/2006', '10/01/2006', '11/20/2006', '01/09/2007', '02/28/2007', '04/19/2007', '06/08/2007', '07/28/2007', '09/16/2007', '11/05/2007', '12/25/2007', '02/13/2008', '04/03/2008', '05/23/2008', '07/12/2008', '08/31/2008', '10/20/2008', '12/09/2008', '01/28/2009', '03/19/2009', '05/08/2009', '06/27/2009', '08/16/2009', '10/05/2009', '11/24/2009', '01/13/2010', '03/04/2010', '04/23/2010', '06/12/2010', '08/01/2010', '09/20/2010', '11/09/2010', '12/29/2010', '02/17/2011', '04/08/2011', '05/28/2011', '07/17/2011', '09/05/2011', '10/24/2011', '11/12/2011', '01/01/2012', '02/20/2012', '04/10/2012', '05/30/2012', '07/19/2012', '09/07/2012', '10/27/2012', '12/16/2012', '02/04/2013', '03/26/2013', '05/15/2013', '07/04/2013', '08/23/2013', '10/12/2013', '12/01/2013', '01/20/2014', '03/11/2014', '04/30/2014', '06/19/2014', '08/08/2014', '09/27/2014', '11/16/2014', '07/05/2014', '01/24/2015', '03/15/2015', '05/04/2015', '06/23/2015', '08/12/2015', '10/01/2015', '11/20/2015', '01/09/2016', '02/28/2016', '04/18/2016', '06/07/2016', '07/27/2016', '09/15/2016', '11/04/2016']

# Import the datetime object from datetime
from datetime import datetime

# Iterate over the dates_list 
for date_str in dates_list:
    # Convert each date to a datetime object: date_dt
    date_dt = datetime.strptime(date_str, "%m/%d/%Y")
    
    # Print each date_dt
    print(date_dt)


datetimes_list = [datetime(2001, 2, 19, 0, 0), datetime(2001, 4, 10, 0, 0), datetime(2001, 5, 30, 0, 0), datetime(2001, 7, 19, 0, 0), datetime(2001, 9, 7, 0, 0), datetime(2001, 10, 27, 0, 0), datetime(2001, 12, 16, 0, 0), datetime(2002, 2, 4, 0, 0), datetime(2002, 3, 26, 0, 0), datetime(2002, 5, 15, 0, 0)]

# Loop over datetimes_list
for item in datetimes_list:
    # Print out the record as a string in the format of 'MM/DD/YYYY'
    print(item.strftime('%m/%d/%Y'))
    
    # Print out the record as an ISO standard string
    print(item.isoformat())



# Create stations data from the CSV downloaded from Chicago Open Data
# https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Daily-Totals/5neh-572f/data
# Filtered the data to download only 2015-2016
import pandas as pd

statRaw = pd.read_csv(myPath + "CTA_Ridership_Station_Entries_Daily_Totals.csv")
statRaw.head()

# mock up daily_summaries as tuple date-rides
x = statRaw.groupby("date")["rides"].sum()
daily_summaries = zip(x.index, x)

# Create a defaultdict of an integer: monthly_total_rides
monthly_total_rides = defaultdict(int)

# Loop over the list daily_summaries
for daily_summary in daily_summaries:
    # Convert the service_date to a datetime object
    service_datetime = datetime.strptime(daily_summary[0], '%m/%d/%Y')
    
    # Add the total rides to the current amount for the month
    monthly_total_rides[service_datetime.month] =+ int(daily_summary[1])

# Print monthly_total_rides
print(monthly_total_rides)


# Import datetime from the datetime module
from datetime import datetime

# Compute the local datetime: local_dt
local_dt = datetime.now()

# Print the local datetime
print(local_dt)

# Compute the UTC datetime: utc_dt
utc_dt = datetime.utcnow()

# Print the UTC datetime
print(utc_dt)


from pytz import timezone

daily_summaries = [(datetime(2001, 1, 1, 10, 27), '126455'), (datetime(2001, 1, 2, 6, 34), '501952'), (datetime(2001, 1, 3, 22, 17), '536432'), (datetime(2001, 1, 4, 15, 20), '550011'), (datetime(2001, 1, 5, 11, 35), '557917'), (datetime(2001, 1, 6, 1, 33), '255356'), (datetime(2001, 1, 7, 5, 58), '169825'), (datetime(2001, 1, 8, 19, 28), '590706'), (datetime(2001, 1, 9, 13, 55), '599905')]

# Create a Timezone object for Chicago
chicago_usa_tz = timezone('US/Central')

# Create a Timezone object for New York
ny_usa_tz = timezone('US/Eastern')

# Iterate over the daily_summaries list
for orig_dt, ridership in daily_summaries:
    # Make the orig_dt timezone "aware" for Chicago
    chicago_dt = orig_dt.replace(tzinfo=chicago_usa_tz)
    
    # Convert chicago_dt to the New York Timezone
    ny_dt = chicago_dt.astimezone(ny_usa_tz)
    
    # Print the chicago_dt, ny_dt, and ridership
    print('Chicago: %s, NY: %s, Ridership: %s' % (chicago_dt, ny_dt, ridership))


review_dates = [datetime(2015, 12, 22, 0, 0), datetime(2015, 12, 23, 0, 0), datetime(2015, 12, 24, 0, 0), datetime(2015, 12, 25, 0, 0), datetime(2015, 12, 26, 0, 0), datetime(2015, 12, 27, 0, 0), datetime(2015, 12, 28, 0, 0), datetime(2015, 12, 29, 0, 0), datetime(2015, 12, 30, 0, 0), datetime(2015, 12, 31, 0, 0)]


# Create a daily_summaries that can be used below
statRaw = pd.read_csv(myPath + "CTA_Ridership_Station_Entries_Daily_Totals.csv")
statRaw.head()

# mock up daily_summaries as tuple date-rides
x = statRaw.groupby(["date", "daytype"])["rides"].sum()
daily_summaries = pd.DataFrame( {"day_type":[a[1] for a in x.index], "total_ridership":[a for a in x]} , index=[ datetime.strptime(a[0], '%m/%d/%Y') for a in x.index]).sort_index()
daily_summaries.head()


# Import timedelta from the datetime module
from datetime import timedelta

# Build a timedelta of 30 days: glanceback
glanceback = timedelta(days=30)

# Iterate over the review_dates as date
for date in review_dates:
    # Calculate the date 30 days back: prior_period_dt
    prior_period_dt = date - glanceback
    
    # Print the review_date, day_type and total_ridership
    print('Date: %s, Type: %s, Total Ridership: %s' %
         (date, 
          daily_summaries.loc[date]['day_type'], 
          daily_summaries.loc[date]['total_ridership']))
    
    # Print the prior_period_dt, day_type and total_ridership
    print('Date: %s, Type: %s, Total Ridership: %s' %
         (prior_period_dt, 
          daily_summaries.loc[prior_period_dt]['day_type'], 
          daily_summaries.loc[prior_period_dt]['total_ridership']))


# Iterate over the date_ranges
# for start_date, end_date in date_ranges:
    # Print the End and Start Date
#     print(end_date, start_date)
    # Print the difference between each end and start date
#     print(end_date - start_date)


# Import the pendulum module
import pendulum

# Create a now datetime for Tokyo: tokyo_dt
tokyo_dt = pendulum.now("Asia/Tokyo")

# Covert the tokyo_dt to Los Angeles: la_dt
la_dt = tokyo_dt.in_timezone('America/Los_Angeles')

# Print the ISO 8601 string of la_dt
print(la_dt.to_iso8601_string())


# Iterate over date_ranges
# for start_date, end_date in date_ranges:
    # Convert the start_date string to a pendulum date: start_dt 
#     start_dt = pendulum.parse(start_date)
    # Convert the end_date string to a pendulum date: end_dt 
#     end_dt = pendulum.parse(end_date)
    # Print the End and Start Date
#     print(end_dt, start_dt)
    # Calculate the difference between end_dt and start_dt: diff_period
# diff_period = end_dt - start_dt
    # Print the difference in days
# print(diff_period.in_days())
## 2001-02-19 00:00:00
## 2001-04-10 00:00:00
## 2001-05-30 00:00:00
## 2001-07-19 00:00:00
## 2001-09-07 00:00:00
## 2001-10-27 00:00:00
## 2001-12-16 00:00:00
## 2002-02-04 00:00:00
## 2002-03-26 00:00:00
## 2002-05-15 00:00:00
## 2002-07-04 00:00:00
## 2002-08-23 00:00:00
## 2002-10-12 00:00:00
## 2002-12-01 00:00:00
## 2003-01-20 00:00:00
## 2003-03-11 00:00:00
## 2003-04-30 00:00:00
## 2003-06-19 00:00:00
## 2003-08-08 00:00:00
## 2003-09-27 00:00:00
## 2003-11-16 00:00:00
## 2004-01-05 00:00:00
## 2004-02-24 00:00:00
## 2004-04-14 00:00:00
## 2004-06-03 00:00:00
## 2004-07-23 00:00:00
## 2004-09-11 00:00:00
## 2004-10-31 00:00:00
## 2004-12-20 00:00:00
## 2005-02-08 00:00:00
## 2005-03-30 00:00:00
## 2005-05-19 00:00:00
## 2005-07-08 00:00:00
## 2005-08-27 00:00:00
## 2005-10-16 00:00:00
## 2005-12-05 00:00:00
## 2006-01-24 00:00:00
## 2006-03-15 00:00:00
## 2006-05-04 00:00:00
## 2006-06-23 00:00:00
## 2006-08-12 00:00:00
## 2006-10-01 00:00:00
## 2006-11-20 00:00:00
## 2007-01-09 00:00:00
## 2007-02-28 00:00:00
## 2007-04-19 00:00:00
## 2007-06-08 00:00:00
## 2007-07-28 00:00:00
## 2007-09-16 00:00:00
## 2007-11-05 00:00:00
## 2007-12-25 00:00:00
## 2008-02-13 00:00:00
## 2008-04-03 00:00:00
## 2008-05-23 00:00:00
## 2008-07-12 00:00:00
## 2008-08-31 00:00:00
## 2008-10-20 00:00:00
## 2008-12-09 00:00:00
## 2009-01-28 00:00:00
## 2009-03-19 00:00:00
## 2009-05-08 00:00:00
## 2009-06-27 00:00:00
## 2009-08-16 00:00:00
## 2009-10-05 00:00:00
## 2009-11-24 00:00:00
## 2010-01-13 00:00:00
## 2010-03-04 00:00:00
## 2010-04-23 00:00:00
## 2010-06-12 00:00:00
## 2010-08-01 00:00:00
## 2010-09-20 00:00:00
## 2010-11-09 00:00:00
## 2010-12-29 00:00:00
## 2011-02-17 00:00:00
## 2011-04-08 00:00:00
## 2011-05-28 00:00:00
## 2011-07-17 00:00:00
## 2011-09-05 00:00:00
## 2011-10-24 00:00:00
## 2011-11-12 00:00:00
## 2012-01-01 00:00:00
## 2012-02-20 00:00:00
## 2012-04-10 00:00:00
## 2012-05-30 00:00:00
## 2012-07-19 00:00:00
## 2012-09-07 00:00:00
## 2012-10-27 00:00:00
## 2012-12-16 00:00:00
## 2013-02-04 00:00:00
## 2013-03-26 00:00:00
## 2013-05-15 00:00:00
## 2013-07-04 00:00:00
## 2013-08-23 00:00:00
## 2013-10-12 00:00:00
## 2013-12-01 00:00:00
## 2014-01-20 00:00:00
## 2014-03-11 00:00:00
## 2014-04-30 00:00:00
## 2014-06-19 00:00:00
## 2014-08-08 00:00:00
## 2014-09-27 00:00:00
## 2014-11-16 00:00:00
## 2014-07-05 00:00:00
## 2015-01-24 00:00:00
## 2015-03-15 00:00:00
## 2015-05-04 00:00:00
## 2015-06-23 00:00:00
## 2015-08-12 00:00:00
## 2015-10-01 00:00:00
## 2015-11-20 00:00:00
## 2016-01-09 00:00:00
## 2016-02-28 00:00:00
## 2016-04-18 00:00:00
## 2016-06-07 00:00:00
## 2016-07-27 00:00:00
## 2016-09-15 00:00:00
## 2016-11-04 00:00:00
## 02/19/2001
## 2001-02-19T00:00:00
## 04/10/2001
## 2001-04-10T00:00:00
## 05/30/2001
## 2001-05-30T00:00:00
## 07/19/2001
## 2001-07-19T00:00:00
## 09/07/2001
## 2001-09-07T00:00:00
## 10/27/2001
## 2001-10-27T00:00:00
## 12/16/2001
## 2001-12-16T00:00:00
## 02/04/2002
## 2002-02-04T00:00:00
## 03/26/2002
## 2002-03-26T00:00:00
## 05/15/2002
## 2002-05-15T00:00:00
## defaultdict(<class 'int'>, {1: 238267, 2: 609798, 3: 622394, 4: 335950, 5: 619492, 6: 641310, 7: 383347, 8: 640894, 9: 649963, 10: 658584, 11: 631904, 12: 295002})
## 2017-07-31 09:25:50.792207
## 2017-07-31 14:25:50.792207
## Chicago: 2001-01-01 10:27:00-05:51, NY: 2001-01-01 11:18:00-05:00, Ridership: 126455
## Chicago: 2001-01-02 06:34:00-05:51, NY: 2001-01-02 07:25:00-05:00, Ridership: 501952
## Chicago: 2001-01-03 22:17:00-05:51, NY: 2001-01-03 23:08:00-05:00, Ridership: 536432
## Chicago: 2001-01-04 15:20:00-05:51, NY: 2001-01-04 16:11:00-05:00, Ridership: 550011
## Chicago: 2001-01-05 11:35:00-05:51, NY: 2001-01-05 12:26:00-05:00, Ridership: 557917
## Chicago: 2001-01-06 01:33:00-05:51, NY: 2001-01-06 02:24:00-05:00, Ridership: 255356
## Chicago: 2001-01-07 05:58:00-05:51, NY: 2001-01-07 06:49:00-05:00, Ridership: 169825
## Chicago: 2001-01-08 19:28:00-05:51, NY: 2001-01-08 20:19:00-05:00, Ridership: 590706
## Chicago: 2001-01-09 13:55:00-05:51, NY: 2001-01-09 14:46:00-05:00, Ridership: 599905
## Date: 2015-12-22 00:00:00, Type: W, Total Ridership: 547458
## Date: 2015-11-22 00:00:00, Type: U, Total Ridership: 276222
## Date: 2015-12-23 00:00:00, Type: W, Total Ridership: 471055
## Date: 2015-11-23 00:00:00, Type: W, Total Ridership: 642924
## Date: 2015-12-24 00:00:00, Type: W, Total Ridership: 312039
## Date: 2015-11-24 00:00:00, Type: W, Total Ridership: 662887
## Date: 2015-12-25 00:00:00, Type: U, Total Ridership: 133225
## Date: 2015-11-25 00:00:00, Type: W, Total Ridership: 549277
## Date: 2015-12-26 00:00:00, Type: A, Total Ridership: 239119
## Date: 2015-11-26 00:00:00, Type: U, Total Ridership: 191233
## Date: 2015-12-27 00:00:00, Type: U, Total Ridership: 223687
## Date: 2015-11-27 00:00:00, Type: W, Total Ridership: 337460
## Date: 2015-12-28 00:00:00, Type: W, Total Ridership: 399002
## Date: 2015-11-28 00:00:00, Type: A, Total Ridership: 322238
## Date: 2015-12-29 00:00:00, Type: W, Total Ridership: 470650
## Date: 2015-11-29 00:00:00, Type: U, Total Ridership: 255475
## Date: 2015-12-30 00:00:00, Type: W, Total Ridership: 482195
## Date: 2015-11-30 00:00:00, Type: W, Total Ridership: 622425
## Date: 2015-12-31 00:00:00, Type: W, Total Ridership: 466078
## Date: 2015-12-01 00:00:00, Type: W, Total Ridership: 654723
## 2017-07-31T07:25:51-07:00

Chapter 5 - Answering Data Science Questions

Counting within Date Ranges - data set is crime data for Chicago:

  • Can access the full database through the Chicago OpenData portal
    • Step 1 - read data from CSV, store in a list
    • Step 2 - use Counter to get counts
    • Step 3 - group data in to a dictionary that is keyed by month - defaultdict

Dictionaries with Time Windows for Keys - crimes by district and differences by block:

  • Step 1 - read CSV data as dictionary using csv.DictReader() ; pop out the key and store the remaining dictionary
  • Step 2 - Pythonically loop over the dictionary using .items()
  • Step 3 - sets for uniqueness, differences in sets

Final thoughts - learned the fundamentals of data types.

Example code includes:


myPath = "./PythonInputFiles/"



# Downloaded 2015 crime data for districts 001, 016, and 019 from
# https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9
# File is in myPath + "Chicago_Crime_2015_001_016_019.csv"


# Import the csv module
import csv

# Create the file object: csvfile
csvfile = open(myPath + "Chicago_Crime_2015_001_016_019.csv", "r")

# Create an empty list: crime_data
crime_data = []

# Loop over a csv reader on the file object
for row in csv.reader(csvfile):
    # Append the date, type of crime, location description, and arrest
    crime_data.append((row[2], row[5], row[7], row[8]))
    # crime_data.append((row[0], row[2], row[4], row[5]))

# Remove the first element from crime_data
crime_data.pop(0)

# Print the first 10 records
print(crime_data[:10])


# Import necessary modules
from collections import Counter
from datetime import datetime

# Create a Counter Object: crimes_by_month
crimes_by_month = Counter()

# Loop over the crime_data list
for x in crime_data:
    # Convert the first element of each item into a Python Datetime Object: date
    date = datetime.strptime(x[0], '%m/%d/%Y %I:%M:%S %p')
    
    # Increment the counter for the month of the row by one
    crimes_by_month[date.month] += 1

# Print the 3 most common months for crime
print(crimes_by_month.most_common(3))


# Import necessary modules
from collections import defaultdict
from datetime import datetime

# Create a dictionary that defaults to a list: locations_by_month
locations_by_month = defaultdict(list)

# Loop over the crime_data list
for row in crime_data:
    # Convert the first element to a date object
    date = datetime.strptime(row[0], '%m/%d/%Y %I:%M:%S %p')
    
    # If the year is 2015 (all I have in this data)
    if date.year == 2015:
        # Set the dictionary key to the month and add the location (third element) to the values list
        locations_by_month[date.month].append(row[2])

# Print the dictionary
# print(locations_by_month)  # WAY too long!


# Import Counter from collections
from collections import Counter

# Loop over the items from locations_by_month using tuple expansion of the month and locations
for month, locations in locations_by_month.items():
    # Make a Counter of the locations
    location_count = Counter(locations)
    # Print the month 
    print(month)
    # Print the most common location
    print(location_count.most_common(5))


# Create the CSV file: csvfile
csvfile = open(myPath + "Chicago_Crime_2015_001_016_019.csv", "r")

# Create a dictionary that defaults to a list: crimes_by_district
crimes_by_district = defaultdict(list)

# Loop over a DictReader of the CSV file
for row in csv.DictReader(csvfile):
    # Pop the district from each row: district
    district = row.pop("District")
    # Append the rest of the data to the list for proper district in crimes_by_district
    crimes_by_district[district].append(row)


# Loop over the crimes_by_district using expansion as district and crimes
for district, crimes in crimes_by_district.items():
    # Print the district
    print(district)
    
    # Create an empty Counter object: year_count
    year_count = Counter()
    
    # Loop over the crimes:
    for crime in crimes:
        # If there was an arrest
        if crime['Arrest'] == 'true':
            # Convert the Date to a datetime and get the year
            year = datetime.strptime(crime["Date"], '%m/%d/%Y %I:%M:%S %p').year
            # Increment the Counter for the year
            year_count[year] += 1
    
    # Print the counter
    print(year_count)
    

# Create the crims_by_block as a dictionary list
crimes_by_block = defaultdict(list)

# Loop over a DictReader of the CSV file
csvfile = open(myPath + "Chicago_Crime_2015_001_016_019.csv", "r")

for row in csv.DictReader(csvfile):
    block = row.pop("Block")
    crimeType = row.pop("Primary Type")
    crimes_by_block[block].append(crimeType)


# Create a unique list of crimes for the first block: n_state_st_crimes
n_state_st_crimes = set(crimes_by_block['001XX N STATE ST'])

# Print the list
print(n_state_st_crimes)

# Create a unique list of crimes for the second block: w_terminal_st_crimes
w_terminal_st_crimes = set(crimes_by_block['0000X W TERMINAL ST'])

# Print the list
print(w_terminal_st_crimes)

# Find the differences between the two blocks: crime_differences
print(n_state_st_crimes.difference(w_terminal_st_crimes))
print(w_terminal_st_crimes.difference(n_state_st_crimes))
## [('05/19/2015 01:12:00 AM', 'ASSAULT', 'APARTMENT', 'true'), ('06/24/2015 06:00:00 AM', 'NARCOTICS', 'RESIDENCE', 'true'), ('07/10/2015 06:00:00 AM', 'NARCOTICS', 'GOVERNMENT BUILDING/PROPERTY', 'true'), ('08/21/2015 02:26:00 PM', 'NARCOTICS', 'PARKING LOT/GARAGE(NON.RESID.)', 'true'), ('03/19/2015 08:05:00 PM', 'NARCOTICS', 'AIRPORT/AIRCRAFT', 'true'), ('03/26/2015 09:45:00 AM', 'NARCOTICS', 'AIRPORT/AIRCRAFT', 'true'), ('04/17/2015 10:44:00 AM', 'NARCOTICS', 'SIDEWALK', 'true'), ('09/08/2015 06:00:00 AM', 'NARCOTICS', 'GOVERNMENT BUILDING/PROPERTY', 'true'), ('05/11/2015 06:30:00 PM', 'NARCOTICS', 'AIRPORT/AIRCRAFT', 'true'), ('03/01/2015 09:00:00 AM', 'OTHER OFFENSE', 'OTHER', 'false')]
## [(8, 3187), (7, 3090), (10, 2969)]
## 5
## [('STREET', 470), ('RESIDENCE', 284), ('APARTMENT', 193), ('OTHER', 189), ('SIDEWALK', 184)]
## 6
## [('STREET', 574), ('RESIDENCE', 316), ('SIDEWALK', 276), ('APARTMENT', 209), ('OTHER', 188)]
## 7
## [('STREET', 616), ('RESIDENCE', 313), ('SIDEWALK', 280), ('OTHER', 236), ('APARTMENT', 186)]
## 8
## [('STREET', 618), ('RESIDENCE', 331), ('SIDEWALK', 282), ('APARTMENT', 199), ('OTHER', 186)]
## 3
## [('STREET', 475), ('RESIDENCE', 297), ('APARTMENT', 204), ('OTHER', 177), ('SIDEWALK', 172)]
## 4
## [('STREET', 438), ('RESIDENCE', 345), ('APARTMENT', 198), ('OTHER', 181), ('SIDEWALK', 161)]
## 9
## [('STREET', 514), ('RESIDENCE', 295), ('SIDEWALK', 276), ('OTHER', 210), ('APARTMENT', 187)]
## 11
## [('STREET', 482), ('RESIDENCE', 260), ('APARTMENT', 212), ('OTHER', 200), ('RESTAURANT', 157)]
## 12
## [('STREET', 547), ('RESIDENCE', 364), ('APARTMENT', 232), ('OTHER', 188), ('RESTAURANT', 162)]
## 1
## [('STREET', 416), ('RESIDENCE', 345), ('OTHER', 191), ('APARTMENT', 187), ('RESTAURANT', 125)]
## 2
## [('STREET', 317), ('RESIDENCE', 271), ('APARTMENT', 165), ('OTHER', 153), ('PARKING LOT/GARAGE(NON.RESID.)', 86)]
## 10
## [('STREET', 534), ('RESIDENCE', 300), ('SIDEWALK', 226), ('OTHER', 224), ('APARTMENT', 219)]
## 019
## Counter({2015: 2122})
## 016
## Counter({2015: 1853})
## 001
## Counter({2015: 2788})
## {'DECEPTIVE PRACTICE', 'ROBBERY', 'LIQUOR LAW VIOLATION', 'NARCOTICS', 'ASSAULT', 'THEFT', 'OTHER OFFENSE', 'PUBLIC PEACE VIOLATION', 'CRIMINAL TRESPASS', 'BATTERY', 'BURGLARY', 'CRIMINAL DAMAGE'}
## {'DECEPTIVE PRACTICE', 'NON-CRIMINAL', 'MOTOR VEHICLE THEFT', 'ROBBERY', 'NARCOTICS', 'ASSAULT', 'THEFT', 'SEX OFFENSE', 'OFFENSE INVOLVING CHILDREN', 'WEAPONS VIOLATION', 'OTHER OFFENSE', 'PUBLIC PEACE VIOLATION', 'CRIMINAL TRESPASS', 'BATTERY', 'BURGLARY', 'STALKING', 'CRIMINAL DAMAGE'}
## {'LIQUOR LAW VIOLATION'}
## {'NON-CRIMINAL', 'MOTOR VEHICLE THEFT', 'SEX OFFENSE', 'OFFENSE INVOLVING CHILDREN', 'WEAPONS VIOLATION', 'STALKING'}

Additional Exploration - CTA

Some additional experimentation with the CTA data, including:

  • Trend in average daily rides by month
  • Average daily rides by daytype
  • Top-20 stations (average daily riders)
  • Average rides by daytype by station
  • Percentage of full-week average by daytype by station
  • Greatest consistency and inconsistency by station and day-type
  • Greatest seasonality by station
  • Patterns by day of week (weekdays only)

Example code includes:


myPath = "./PythonInputFiles/"



# Create stations data from the CSV downloaded from Chicago Open Data
# https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Daily-Totals/5neh-572f/data
# Filtered the data to download only 2015-2016
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt


statRaw = pd.read_csv(myPath + "CTA_Ridership_Station_Entries_Daily_Totals.csv")
statRaw["convDate"] = [datetime.strptime(x, "%m/%d/%Y") for x in statRaw["date"]]
statRaw.head()


# Average daily rides by month
dailyRides = statRaw[["convDate", "rides"]].groupby("convDate").sum()
avgMonthlyRides = dailyRides.resample("M").mean()
print(round(avgMonthlyRides, 0))
avgMonthlyRides.plot()
plt.ylim([0, round(max(avgMonthlyRides["rides"]), -5) + 50000])
plt.title("Average Daily Rides by Month (CTA)")
plt.xlabel("")
plt.ylabel("Average Daily Rides")
# plt.show()
plt.savefig("_dummyPy093.png", bbox_inches="tight")
plt.clf()


# Same axis
convMonthlyRides = avgMonthlyRides.copy()
convMonthlyRides["year"] = convMonthlyRides.index.year
convMonthlyRides["month"] = convMonthlyRides.index.month
convMonthlyRides = convMonthlyRides.pivot_table(index="month", values="rides", columns="year", aggfunc=sum)
convMonthlyRides.plot()
plt.ylim([0, round(max(avgMonthlyRides["rides"]), -5) + 50000])
plt.title("Average Daily Rides by Month (CTA)")
plt.xlabel("Month")
plt.ylabel("Average Daily Rides")
# plt.show()
plt.savefig("_dummyPy094.png", bbox_inches="tight")
plt.clf()


# Average daily rides by daytype
typeRides = statRaw[["daytype", "convDate", "rides"]].groupby(["convDate", "daytype"]).sum()
print(round(typeRides.groupby("daytype").mean(), 0))
typeRides.groupby("daytype").mean().plot(kind="bar")
plt.title("Average Daily Rides by Day Type 2015-2016 (CTA)")
plt.xlabel("Day Type (A=Sat, U=Sun/Hol, W=Weekday)")
plt.ylabel("Average Daily Rides")
# plt.show()
plt.savefig("_dummyPy095.png", bbox_inches="tight")
plt.clf()


# Average daily rides by station
stationRides = statRaw[["stationname", "rides"]].groupby(["stationname"]).mean().sort_values("rides", ascending=False)
print(round(stationRides.iloc[:20, :], 0))
print(round(stationRides.iloc[-20:, :], 0))
stationRides.plot(kind="bar")
plt.title("Average Daily Rides by Station 2015-2016 (CTA)")
plt.xticks([])
plt.ylim([0, round(max(stationRides["rides"]), -4) + 5000])
plt.xlabel("Stations Sorted by Descending Rides")
plt.ylabel("Average Daily Rides")
# plt.show()
plt.savefig("_dummyPy096.png", bbox_inches="tight")
plt.clf()


import numpy as np

# Average daily rides by daytype by station
daytypeRides = statRaw.pivot_table(index="stationname", values="rides", columns="daytype", aggfunc=np.mean)
print(round(daytypeRides.loc[stationRides.iloc[:20, :].index, :], 0))
print(round(daytypeRides.loc[stationRides.iloc[-20:, :].index, :], 0))


# Deviation from average by daytype
daytypeRides["totMean"] = stationRides.loc[daytypeRides.index, "rides"]
ratA = daytypeRides["A"] / daytypeRides["totMean"]
ratU = daytypeRides["U"] / daytypeRides["totMean"]
ratW = daytypeRides["W"] / daytypeRides["totMean"]

print(round(ratA.sort_values(ascending=False)[0:10], 3))
print(round(ratU.sort_values(ascending=False)[0:10], 3))
print(round(ratW.sort_values(ascending=False)[0:10], 3))
print(round(ratA.sort_values(ascending=False)[-10:], 3))
print(round(ratU.sort_values(ascending=False)[-10:], 3))

ratW.sort_values(ascending=False).plot()
ratA.sort_values(ascending=False).plot()
ratU.sort_values(ascending=False).plot()
plt.ylim([0, 1.5])
plt.xticks([])
plt.title("Percentage of Average Daily Rides by Day Type")
plt.xlabel("Station - Sorted Independently for Each Day Type")
plt.ylabel("% of Daily Average Rides on Day Type")
plt.legend(["W (Weekday)", "A (Saturday)", "U (Sun/Hol)"])
# plt.show()
plt.savefig("_dummyPy097.png", bbox_inches="tight")
plt.clf()


# Greatest consistency and inconsistency by station and daytype
statDayType = pd.DataFrame( {"ratW":ratW, "ratA":ratA, "ratU":ratU} )[["ratW", "ratA", "ratU"]]
statDayType["STD"] = statDayType[["ratW", "ratA", "ratU"]].apply(np.std, axis=1)
print(round(statDayType.sort_values("STD", ascending=False).iloc[:20, :], 3))
print(round(statDayType.sort_values("STD", ascending=True).iloc[:20, :], 3))
statDayType.sort_values("STD", ascending=False).plot()
plt.xticks([])
plt.xlabel("Station - Sorted by Decreasing Consistency by Day Type")
plt.legend(["Weekday", "Sat", "Sun/Hol", "Deviation"])
# plt.show()
plt.savefig("_dummyPy098.png", bbox_inches="tight")
plt.clf()


# statAU = statDayType[["ratA", "ratU"]]
# statAU["Delta"] = (statAU["ratA"] - statAU["ratU"]) / (statAU["ratA"] + statAU["ratU"])
# print(round(statAU.sort_values("Delta", ascending=False).iloc[:20, :], 3))
# print(round(statAU.sort_values("Delta", ascending=True).iloc[:20, :], 3))
# statAU.sort_values("Delta", ascending=False).plot()
# plt.xticks([])
# plt.show()


# Greatest seasonality by station
# Use month as a surrogate for season, and compare percent by month to system totals
statMonth = [x.month for x in statRaw["convDate"]]
miniStation = statRaw[["stationname", "rides"]]
miniStation["month"] = statMonth
miniPivot = miniStation.pivot_table(index="stationname", values="rides", columns="month", aggfunc=sum)

miniColSum = miniPivot.apply(sum, axis=0)
miniRowSum = miniPivot.apply(sum, axis=1)
benchPct = miniColSum / sum(miniColSum)

miniPct = miniPivot.copy()
for x in miniPct.columns:
    miniPct[x] = miniPct[x] / miniRowSum

miniDev = [sum((miniPct.loc[x, :] - benchPct) ** 2) ** 0.5 for x in miniPct.index]
miniPct["Deviation"] = miniDev
topDev = miniPct.sort_values("Deviation", ascending=False)
del miniPct["Deviation"]
topDev.loc[:, "Deviation"].plot()
plt.xticks([])
plt.title("Station Seasonality vs. System Seasonality (RMSE)")
plt.xlabel("Station")
plt.ylabel("RMSE")
# plt.show()
plt.savefig("_dummyPy099.png", bbox_inches="tight")
plt.clf()


print(topDev.iloc[0:20, :])

benchPct.plot()
plt.ylim([0.025, 0.175])

# Skip the station that closed mid-year (Madison/Wabash) - use index 1, 2, 3, 4, 5 only
for a in topDev.index[1:6]:
    miniPct.loc[a, :].plot()

plt.legend(["System Average", topDev.index[1], topDev.index[2], topDev.index[3], topDev.index[4], topDev.index[5]], loc="upper center")
plt.title("Stations with Greatest Seasonality vs. System (RMSE)")
plt.xlabel("Month")
plt.ylabel("% of Annual Rides in Month")
# plt.show()
plt.savefig("_dummyPy100.png", bbox_inches="tight")
plt.clf()



# Patterns by day of week
# Break weekday in to M/Tu/We/Th/F and eliminate weekday holidays
testStation = statRaw.copy()
testStation["weekday"] = [x.weekday() for x in testStation["convDate"]]
testStation["weekday"].value_counts()
testStation.groupby(["daytype", "weekday"]).count()
myBool = (testStation["daytype"] != "U") | (testStation["weekday"] == 6)
useStation = testStation.loc[myBool, :]
print(useStation.groupby(["daytype", "weekday"]).count())

a = useStation[["weekday", "rides", "convDate"]].groupby(["convDate", "weekday"]).sum().groupby("weekday").mean()
print(a)
a.plot(kind="bar")
plt.xlabel("")
plt.ylabel("Average Rides per Day")
plt.title("Average Rides per Day by Day of Week (CTA 2015-2016)")
plt.xticks(np.arange(7), ["Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"], rotation=0)
# plt.show()
plt.savefig("_dummyPy101.png", bbox_inches="tight")
plt.clf()


workDay = useStation.loc[useStation["daytype"] == "W", :].pivot_table(index="stationname", values="rides", columns="weekday", aggfunc=np.mean)
workDay["STD"] = [np.sqrt(sum( (workDay.loc[b, :] / sum(workDay.loc[b, :]) - 0.2) ** 2 )) for b in workDay.index]
workDay.sort_values("STD", ascending=False)["STD"].plot(kind="bar")
plt.xticks([])
plt.xlabel("Stations sorted from Least to Most Consistent")
plt.ylabel("Inconsistency (RMSE)")
plt.title("Consistency by Workday and Station (CTA 2015-2016)")
# plt.show()
plt.savefig("_dummyPy102.png", bbox_inches="tight")
plt.clf()


print(workDay.sort_values("STD", ascending=False).iloc[0:6, :])

(workDay.iloc[:, 0:5].apply(sum, axis=0) / sum(workDay.iloc[:, 0:5].apply(sum, axis=0))).plot()
for c in range(4):
    d = workDay.sort_values("STD", ascending=False).iloc[c, 0:5]
    (d / sum(d)).plot()

plt.xticks(np.arange(5), ["Mon", "Tues", "Wed", "Thurs", "Fri"])
plt.ylim([0.15, 0.25])
plt.xlabel("")
plt.ylabel("Proportion of Workday Rides")
plt.title("Outlier Stations for Workday Ride Patterns (CTA 2015-2016)")
plt.legend()
# plt.show()
plt.savefig("_dummyPy103.png", bbox_inches="tight")
plt.clf()
##                rides
## convDate            
## 2015-01-31  474145.0
## 2015-02-28  503489.0
## 2015-03-31  530615.0
## 2015-04-30  546817.0
## 2015-05-31  532022.0
## 2015-06-30  575190.0
## 2015-07-31  579919.0
## 2015-08-31  549998.0
## 2015-09-30  592430.0
## 2015-10-31  601043.0
## 2015-11-30  532817.0
## 2015-12-31  491164.0
## 2016-01-31  478600.0
## 2016-02-29  524561.0
## 2016-03-31  537988.0
## 2016-04-30  538034.0
## 2016-05-31  535405.0
## 2016-06-30  569531.0
## 2016-07-31  542375.0
## 2016-08-31  544725.0
## 2016-09-30  569900.0
## 2016-10-31  576956.0
## 2016-11-30  544092.0
## 2016-12-31  451587.0
##             rides
## daytype          
## A        381327.0
## U        288387.0
## W        627657.0
##                        rides
## stationname                 
## Lake/State           19186.0
## Clark/Lake           16374.0
## Chicago/State        14399.0
## Grand/State          11836.0
## Belmont-North Main   11830.0
## Fullerton            11399.0
## O'Hare Airport       11015.0
## Roosevelt            10427.0
## Washington/Dearborn  10176.0
## 95th/Dan Ryan         9774.0
## Monroe/State          9495.0
## Jackson/State         9147.0
## State/Lake            8813.0
## Addison-North Main    8477.0
## Randolph/Wabash       8185.0
## Midway Airport        7739.0
## Adams/Wabash          7656.0
## Clark/Division        7270.0
## 79th                  6664.0
## Jackson/Dearborn      6459.0
##                      rides
## stationname               
## Pulaski-Cermak      1068.0
## Western-Cermak      1057.0
## Harlem-Forest Park  1049.0
## Kedzie-Cermak       1003.0
## California-Lake      968.0
## 51st                 947.0
## 43rd                 941.0
## Linden               879.0
## Conservatory         865.0
## Dempster             800.0
## Foster               784.0
## Indiana              778.0
## Noyes                732.0
## Central-Evanston     715.0
## South Boulevard      683.0
## Halsted/63rd         628.0
## Oakton-Skokie        583.0
## King Drive           555.0
## Madison/Wabash       540.0
## Kostner              462.0
## daytype                    A        U        W
## stationname                                   
## Lake/State           14223.0  10200.0  22252.0
## Clark/Lake            6846.0   5466.0  20817.0
## Chicago/State        13201.0   9736.0  15706.0
## Grand/State          12120.0   9363.0  12340.0
## Belmont-North Main   10877.0   8334.0  12821.0
## Fullerton             9079.0   6611.0  12965.0
## O'Hare Airport        9443.0  10300.0  11501.0
## Roosevelt             9586.0   7664.0  11229.0
## Washington/Dearborn   6398.0   4699.0  12199.0
## 95th/Dan Ryan         7023.0   5529.0  11306.0
## Monroe/State          5298.0   3842.0  11645.0
## Jackson/State         4968.0   3714.0  11243.0
## State/Lake            6201.0   4463.0  10341.0
## Addison-North Main    8889.0   7147.0   8695.0
## Randolph/Wabash       5116.0   3521.0   9877.0
## Midway Airport        4827.0   4195.0   9145.0
## Adams/Wabash          4462.0   3298.0   9305.0
## Clark/Division        6667.0   5185.0   7869.0
## 79th                  5298.0   4260.0   7491.0
## Jackson/Dearborn      3387.0   2645.0   7958.0
## daytype                 A      U       W
## stationname                             
## Pulaski-Cermak      804.0  613.0  1226.0
## Western-Cermak      789.0  597.0  1217.0
## Harlem-Forest Park  709.0  505.0  1243.0
## Kedzie-Cermak       780.0  589.0  1144.0
## California-Lake     656.0  522.0  1133.0
## 51st                744.0  542.0  1081.0
## 43rd                649.0  503.0  1100.0
## Linden              749.0  552.0   979.0
## Conservatory        686.0  533.0   977.0
## Dempster            728.0  557.0   870.0
## Foster              592.0  429.0   905.0
## Indiana             513.0  436.0   911.0
## Noyes               537.0  365.0   855.0
## Central-Evanston    650.0  323.0   818.0
## South Boulevard     457.0  327.0   811.0
## Halsted/63rd        433.0  325.0   737.0
## Oakton-Skokie       344.0  238.0   711.0
## King Drive          430.0  344.0   629.0
## Madison/Wabash      415.0  202.0   643.0
## Kostner             309.0  242.0   543.0
## stationname
## Cermak-Chinatown          1.093
## Addison-North Main        1.049
## Grand/State               1.024
## North/Clybourn            0.928
## Belmont-North Main        0.919
## Roosevelt                 0.919
## Clark/Division            0.917
## Chicago/State             0.917
## Cermak-McCormick Place    0.910
## Dempster                  0.910
## dtype: float64
## stationname
## O'Hare Airport         0.935
## Cermak-Chinatown       0.881
## Addison-North Main     0.843
## Grand/State            0.791
## Roosevelt              0.735
## Clark/Division         0.713
## Pulaski-Forest Park    0.708
## Belmont-North Main     0.704
## Dempster               0.696
## North/Clybourn         0.695
## dtype: float64
## stationname
## LaSalle/Van Buren    1.341
## Washington/Wells     1.338
## Merchandise Mart     1.329
## Quincy/Wells         1.322
## Polk                 1.304
## Chicago/Franklin     1.286
## Medical Center       1.275
## Monroe/Dearborn      1.272
## Clinton-Lake         1.272
## Clark/Lake           1.271
## dtype: float64
## stationname
## Medical Center       0.433
## Chicago/Franklin     0.431
## Clinton-Lake         0.431
## Clark/Lake           0.418
## Monroe/Dearborn      0.413
## Polk                 0.350
## Merchandise Mart     0.300
## Quincy/Wells         0.282
## Washington/Wells     0.259
## LaSalle/Van Buren    0.249
## dtype: float64
## stationname
## Monroe/Dearborn      0.333
## UIC-Halsted          0.329
## Clinton-Lake         0.317
## Medical Center       0.304
## Chicago/Franklin     0.256
## Polk                 0.250
## Quincy/Wells         0.234
## Merchandise Mart     0.188
## Washington/Wells     0.186
## LaSalle/Van Buren    0.180
## dtype: float64
##                        ratW   ratA   ratU    STD
## stationname                                     
## LaSalle/Van Buren     1.341  0.249  0.180  0.532
## Washington/Wells      1.338  0.259  0.186  0.527
## Merchandise Mart      1.329  0.300  0.188  0.513
## Quincy/Wells          1.322  0.282  0.234  0.502
## Polk                  1.304  0.350  0.250  0.475
## Chicago/Franklin      1.286  0.431  0.256  0.450
## Medical Center        1.275  0.433  0.304  0.431
## Clinton-Lake          1.272  0.431  0.317  0.426
## Monroe/Dearborn       1.272  0.413  0.333  0.425
## Clark/Lake            1.271  0.418  0.334  0.423
## UIC-Halsted           1.266  0.449  0.329  0.416
## Ridgeland             1.253  0.509  0.334  0.398
## Oak Park-Forest Park  1.247  0.502  0.367  0.387
## Clinton-Forest Park   1.234  0.478  0.443  0.365
## Jackson/Dearborn      1.232  0.524  0.410  0.364
## Racine                1.230  0.540  0.406  0.361
## Jackson/State         1.229  0.543  0.406  0.360
## Wellington            1.226  0.573  0.391  0.359
## Pulaski-Orange        1.226  0.563  0.400  0.357
## Armitage              1.223  0.600  0.381  0.357
##                          ratW   ratA   ratU    STD
## stationname                                       
## O'Hare Airport          1.044  0.857  0.935  0.077
## Cermak-Chinatown        1.008  1.093  0.881  0.087
## Addison-North Main      1.026  1.049  0.843  0.092
## Grand/State             1.043  1.024  0.791  0.114
## Roosevelt               1.077  0.919  0.735  0.140
## Clark/Division          1.082  0.917  0.713  0.151
## Belmont-North Main      1.084  0.919  0.704  0.155
## North/Clybourn          1.084  0.928  0.695  0.160
## Dempster                1.088  0.910  0.696  0.160
## Pulaski-Forest Park     1.097  0.851  0.708  0.160
## Laramie                 1.094  0.889  0.687  0.166
## Chicago/State           1.091  0.917  0.676  0.170
## Jarvis                  1.101  0.869  0.675  0.174
## Cermak-McCormick Place  1.096  0.910  0.660  0.179
## Argyle                  1.106  0.849  0.670  0.179
## Sox-35th-Dan Ryan       1.107  0.842  0.671  0.179
## Morse                   1.109  0.839  0.668  0.181
## Harrison                1.099  0.898  0.655  0.182
## Lawrence                1.108  0.848  0.660  0.184
## Granville               1.104  0.891  0.642  0.189
## month                          1         2         3         4         5  \
## stationname                                                                
## Madison/Wabash          0.398988  0.385555  0.214910  0.000000  0.000000   
## Oakton-Skokie           0.096158  0.097952  0.111193  0.111125  0.085027   
## Dempster-Skokie         0.094221  0.092433  0.106116  0.110728  0.091168   
## UIC-Halsted             0.078581  0.093692  0.091518  0.096727  0.056781   
## Addison-North Main      0.055724  0.053556  0.064407  0.083002  0.099099   
## Cermak-McCormick Place  0.032409  0.062367  0.079548  0.080936  0.088567   
## Linden                  0.061200  0.058489  0.067087  0.072109  0.091219   
## Fullerton               0.085145  0.082689  0.086284  0.092141  0.093940   
## California-Cermak       0.073007  0.071432  0.080919  0.078071  0.080125   
## Sox-35th-Dan Ryan       0.067474  0.066771  0.076469  0.085913  0.094846   
## Racine                  0.082383  0.081751  0.095961  0.082130  0.090382   
## Laramie                 0.076938  0.073433  0.082686  0.098942  0.101857   
## Harrison                0.067872  0.078004  0.084941  0.089019  0.085421   
## Central-Evanston        0.072770  0.070955  0.076965  0.074477  0.075629   
## Grand/State             0.069566  0.064770  0.079480  0.079286  0.086596   
## Noyes                   0.080514  0.081245  0.081505  0.086206  0.086941   
## Jackson/State           0.079003  0.080189  0.086028  0.090937  0.087688   
## O'Hare Airport          0.066802  0.061858  0.076246  0.078602  0.090151   
## Medical Center          0.079218  0.080000  0.091860  0.085438  0.078145   
## Loyola                  0.076656  0.081891  0.086326  0.086079  0.075960   
## 
## month                          6         7         8         9        10  \
## stationname                                                                
## Madison/Wabash          0.000000  0.000000  0.000000  0.000000  0.000000   
## Oakton-Skokie           0.058680  0.057094  0.059329  0.062134  0.065861   
## Dempster-Skokie         0.060479  0.062203  0.064026  0.061321  0.065587   
## UIC-Halsted             0.055199  0.056054  0.074511  0.115300  0.120902   
## Addison-North Main      0.101552  0.109265  0.111560  0.104275  0.100128   
## Cermak-McCormick Place  0.090257  0.101195  0.088459  0.096139  0.118213   
## Linden                  0.100439  0.111700  0.107350  0.093748  0.097619   
## Fullerton               0.079906  0.072579  0.069316  0.096725  0.105022   
## California-Cermak       0.083142  0.083541  0.083823  0.125757  0.087759   
## Sox-35th-Dan Ryan       0.093067  0.100424  0.097194  0.094181  0.087019   
## Racine                  0.086149  0.069863  0.073575  0.089183  0.095642   
## Laramie                 0.082131  0.081475  0.080246  0.083775  0.087000   
## Harrison                0.077955  0.087137  0.071646  0.094071  0.105191   
## Central-Evanston        0.088274  0.088917  0.087022  0.103993  0.099941   
## Grand/State             0.091709  0.103065  0.096821  0.086325  0.090692   
## Noyes                   0.083811  0.085791  0.075373  0.083252  0.100941   
## Jackson/State           0.082838  0.083690  0.077902  0.092210  0.098377   
## O'Hare Airport          0.089674  0.095615  0.093757  0.091446  0.095852   
## Medical Center          0.081168  0.075943  0.077842  0.092896  0.098200   
## Loyola                  0.078668  0.079758  0.081569  0.097936  0.097516   
## 
## month                         11        12  Deviation  
## stationname                                            
## Madison/Wabash          0.000000  0.000547   0.533109  
## Oakton-Skokie           0.104118  0.091329   0.085198  
## Dempster-Skokie         0.100770  0.090949   0.077561  
## UIC-Halsted             0.103581  0.057152   0.076466  
## Addison-North Main      0.067443  0.049991   0.061739  
## Cermak-McCormick Place  0.091843  0.070068   0.054797  
## Linden                  0.074121  0.064920   0.047597  
## Fullerton               0.082711  0.053544   0.039950  
## California-Cermak       0.079249  0.073173   0.039019  
## Sox-35th-Dan Ryan       0.073028  0.063615   0.029337  
## Racine                  0.080634  0.072346   0.028373  
## Laramie                 0.077736  0.073780   0.027801  
## Harrison                0.087984  0.070757   0.025257  
## Central-Evanston        0.089313  0.071745   0.023600  
## Grand/State             0.078699  0.072990   0.023107  
## Noyes                   0.090357  0.064064   0.022523  
## Jackson/State           0.080978  0.060161   0.022092  
## O'Hare Airport          0.084282  0.075715   0.021967  
## Medical Center          0.085717  0.073572   0.021767  
## Loyola                  0.087161  0.070479   0.021431  -c:136: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
## 
##                  station_id  stationname   date  rides  convDate
## daytype weekday                                                 
## A       5             15120        15120  15120  15120     15120
## U       6             14976        14976  14976  14976     14976
## W       0             14112        14112  14112  14112     14112
##         1             14976        14976  14976  14976     14976
##         2             14976        14976  14976  14976     14976
##         3             14688        14688  14688  14688     14688
##         4             14688        14688  14688  14688     14688
##                  rides
## weekday               
## 0        605158.030612
## 1        631270.884615
## 2        630407.423077
## 3        636834.637255
## 4        633605.245098
## 5        381326.885714
## 6        290429.932692
## weekday                        0             1             2             3  \
## stationname                                                                  
## Addison-North Main   7894.020408   8665.096154   8586.673077   8392.549020   
## O'Hare Airport      11539.438776  10534.442308  10611.769231  11961.754902   
## Grand/State         11465.602041  11898.673077  12045.798077  12363.490196   
## Jackson/State       11220.979592  11798.278846  11690.461538  11683.372549   
## Madison/Wabash        597.040816    637.326923    632.923077    637.245098   
## Cermak-Chinatown     4068.969388   4158.807692   4150.019231   4249.205882   
## 
## weekday                        4       STD  
## stationname                                 
## Addison-North Main   9908.529412  0.034280  
## O'Hare Airport      12895.421569  0.034244  
## Grand/State         13908.558824  0.030359  
## Jackson/State        9801.088235  0.029682  
## Madison/Wabash        706.705882  0.024785  
## Cermak-Chinatown     4721.176471  0.024392

Average Daily Rides by Month - Chicago Train (CTA) 2015-2016:

Average Daily Rides by Month - 2015 vs 2016:

Average Daily Rides by Day Type - (CTA 2015-2016):

Average Daily Rides by Station - (CTA 2015-2016):

Average Daily Rides by Station and Day Type - (CTA 2015-2016):

Consistency of Average Rides by Station and Day Type - (CTA 2015-2016):

Seasonality of Average Rides by Station - (CTA 2015-2016):

Stations with Greatest Seasonality of Average Rides - (CTA 2015-2016):

Average Daily Rides by Day of Week - (CTA 2015-2016):

Consistency by Station of Average Daily Rides by Day of Week - (CTA 2015-2016):

Stations with Greatest Difference from System Average Rides by Day of Week - (CTA 2015-2016):


Additional Exploration - Chicago Crimes

Some additional experimentation with the Chicago Crime data, including:

  • Trend in total crime by day and month
  • Total crime by type and district
  • Clearance rate by crime type
  • Cleared crimes by crime type

Example code includes:


myPath = "./PythonInputFiles/"



# Chicago Open Data crime database - filtered for 2015 only and districts 001, 016, and 019
# https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9
# File is in myPath + "Chicago_Crime_2015_001_016_019.csv"
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np


rawCrime = pd.read_csv(myPath + "Chicago_Crime_2015_001_016_019.csv")
filtCrime = rawCrime[["Date", "Block", "Primary Type", "Description", "Location Description", "Arrest", "District", "Beat", "Ward", "Community Area"]]
filtCrime["convDate"] = [datetime.strptime(x.split()[0], "%m/%d/%Y") for x in filtCrime["Date"]]


# Total crime by day and month
dateCrime = filtCrime[["convDate", "Block"]].groupby("convDate").count()
dateCrime.plot()
plt.ylim([0, 10 * round(max(dateCrime["Block"]) / 10, 0) + 10])
plt.xlabel("")
plt.title("Chicago Crimes by Day in 2015 \n(Districts 001, 016, 019)")
# plt.show()
plt.savefig("_dummyPy104.png", bbox_inches="tight")
plt.clf()


dateCrime.resample("M").sum().plot(kind="bar")
plt.xticks(np.arange(12), ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"], rotation=0)
plt.xlabel("")
plt.title("Chicago Crimes by Month in 2015 \n(Districts 001, 016, 019)")
# plt.show()
plt.savefig("_dummyPy105.png", bbox_inches="tight")
plt.clf()


# Total crime by type and District
typeCrime = filtCrime.pivot_table(index="Primary Type", columns="District", values="Block", aggfunc=len).fillna(0)
typeCrime["Total"] = typeCrime.apply(sum, axis=1)

print(typeCrime.sort_values("Total", ascending=False).iloc[0:20, :])


# Clearance Rate by Crime Type
arrestCrime = filtCrime[["Primary Type", "Arrest"]].pivot_table(index="Primary Type", columns="Arrest", aggfunc=len).fillna(0)
arrestCrime["Total"] = arrestCrime.apply(sum, axis=1)
arrestCrime["Clear"] = arrestCrime[True] / arrestCrime["Total"]
arrestCrime = arrestCrime.sort_values("Total", ascending=False)

print(arrestCrime.iloc[0:20, :])
nPlot = 12

fig, ax1 = plt.subplots()

(arrestCrime["Total"][0:nPlot]/1000).plot(kind="bar")
plt.title("Chicago Crimes and Clearance Rate in 2015 \n(Districts 001, 016, 019)")
plt.xlabel("Crime Type")
xTickNewLine = [x.capitalize().replace(" ", "\n") for x in arrestCrime.index]
plt.xticks(np.arange(nPlot), xTickNewLine[0:nPlot], fontsize=9, rotation=90)

ax1.set_ylabel("Total Crimes (000)", color="b")
ax1.tick_params("y", colors="b")

ax2 = plt.twinx()
ax2.plot(list(arrestCrime["Clear"][0:nPlot]), "r-")
ax2.set_ylabel("Clearance Rate", color="r")
ax2.tick_params("y", colors="r")

plt.tight_layout()
# plt.show()
plt.savefig("_dummyPy106.png", bbox_inches="tight")
plt.clf()


# Chicago Crimes Cleared
nPlot = 12
(arrestCrime.sort_values(True, ascending=False)[True][0:nPlot]/1000).plot(kind="bar")
plt.title("Chicago Crimes Cleared in 2015 \n(Districts 001, 016, 019)")
plt.xlabel("Crime Type")
xTickNewLine = [x.capitalize().replace(" ", "\n") for x in arrestCrime.sort_values(True, ascending=False).index]
plt.xticks(np.arange(nPlot), xTickNewLine[0:nPlot], fontsize=9, rotation=90)
plt.ylabel("Total Crimes Cleared (000)")
plt.tight_layout()
# plt.show()
plt.savefig("_dummyPy107.png", bbox_inches="tight")
plt.clf()


# Total crime by location description
locCrime = filtCrime["Location Description"].value_counts()
print(locCrime[0:20])
print(locCrime[0:20].cumsum() / sum(locCrime))
nPlot=15
(locCrime[0:nPlot].cumsum() / sum(locCrime)).plot(kind="bar")
plt.ylim([0, 1])
plt.ylabel("Cumulative percentage of locations")
plt.xlabel("")
plt.title("ECDF for crime locations - Chicago 2015\n(Districts 001, 016, 019)")
xTickNewLine = [x[0:20].capitalize().replace(" ", "\n") for x in locCrime.index]
plt.xticks(np.arange(nPlot), xTickNewLine[0:nPlot], fontsize=8, rotation=90)
plt.tight_layout()
# plt.show()
plt.savefig("_dummyPy108.png", bbox_inches="tight")
plt.clf()


# Total crime by location description and district
locDistCrime = filtCrime[["Location Description", "District"]].pivot_table(index="Location Description", columns="District", aggfunc=len).fillna(0)
locDistCrime["Total"] = locDistCrime.apply(sum, axis=1)
locDistCrime = locDistCrime.sort_values("Total", ascending=False)

nPlot = 15

fig, ax1 = plt.subplots()

(locDistCrime["Total"][0:nPlot]).plot(kind="bar", color="b", alpha=0.5)
plt.title("Chicago Crime Locations by District in 2015 \n(Districts 001, 016, 019)")
plt.xlabel("Location Description")
xTickNewLine = [x[0:20].capitalize().replace(" ", "\n") for x in locDistCrime.index]
plt.xticks(np.arange(nPlot), xTickNewLine[0:nPlot], fontsize=8, rotation=90)

ax1.set_ylabel("Total Crimes", color="b")
ax1.tick_params("y", colors="b")

ax2 = plt.twinx()
ax2.plot(list((locDistCrime[1]/locDistCrime["Total"])[0:nPlot]), "r-")
ax2.plot(list((locDistCrime[16]/locDistCrime["Total"])[0:nPlot]), "g-")
ax2.plot(list((locDistCrime[19]/locDistCrime["Total"])[0:nPlot]), "y-")
ax2.set_ylim([0, 1])
ax2.set_ylabel("Proportion by District")

plt.legend(["001", "016", "019"])
plt.tight_layout()
# plt.show()
plt.savefig("_dummyPy109.png", bbox_inches="tight")
plt.clf()
## -c:17: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
## District                               1      16      19    Total
## Primary Type                                                     
## THEFT                             5651.0  2192.0  4177.0  12020.0
## BATTERY                           1313.0  1406.0  1600.0   4319.0
## DECEPTIVE PRACTICE                1431.0   771.0  1069.0   3271.0
## CRIMINAL DAMAGE                    732.0  1201.0  1169.0   3102.0
## OTHER OFFENSE                      471.0   842.0   435.0   1748.0
## ASSAULT                            530.0   528.0   490.0   1548.0
## CRIMINAL TRESPASS                  596.0   402.0   398.0   1396.0
## BURGLARY                           176.0   597.0   597.0   1370.0
## NARCOTICS                          219.0   528.0   404.0   1151.0
## MOTOR VEHICLE THEFT                196.0   344.0   416.0    956.0
## ROBBERY                            319.0   159.0   371.0    849.0
## PUBLIC PEACE VIOLATION              99.0    92.0    76.0    267.0
## OFFENSE INVOLVING CHILDREN          47.0    88.0    64.0    199.0
## CRIM SEXUAL ASSAULT                 28.0    40.0    86.0    154.0
## SEX OFFENSE                         49.0    44.0    49.0    142.0
## WEAPONS VIOLATION                   23.0    48.0    33.0    104.0
## INTERFERENCE WITH PUBLIC OFFICER    14.0    20.0    26.0     60.0
## LIQUOR LAW VIOLATION                12.0    12.0    26.0     50.0
## PROSTITUTION                        24.0     7.0     3.0     34.0
## ARSON                                7.0     8.0    19.0     34.0
## Arrest                              False    True    Total     Clear
## Primary Type                                                        
## THEFT                             10424.0  1596.0  12020.0  0.132779
## BATTERY                            3119.0  1200.0   4319.0  0.277842
## DECEPTIVE PRACTICE                 3070.0   201.0   3271.0  0.061449
## CRIMINAL DAMAGE                    2909.0   193.0   3102.0  0.062218
## OTHER OFFENSE                      1551.0   197.0   1748.0  0.112700
## ASSAULT                            1152.0   396.0   1548.0  0.255814
## CRIMINAL TRESPASS                   339.0  1057.0   1396.0  0.757163
## BURGLARY                           1281.0    89.0   1370.0  0.064964
## NARCOTICS                             2.0  1149.0   1151.0  0.998262
## MOTOR VEHICLE THEFT                 892.0    64.0    956.0  0.066946
## ROBBERY                             737.0   112.0    849.0  0.131920
## PUBLIC PEACE VIOLATION              115.0   152.0    267.0  0.569288
## OFFENSE INVOLVING CHILDREN          171.0    28.0    199.0  0.140704
## CRIM SEXUAL ASSAULT                 142.0    12.0    154.0  0.077922
## SEX OFFENSE                          99.0    43.0    142.0  0.302817
## WEAPONS VIOLATION                    22.0    82.0    104.0  0.788462
## INTERFERENCE WITH PUBLIC OFFICER      3.0    57.0     60.0  0.950000
## LIQUOR LAW VIOLATION                  0.0    50.0     50.0  1.000000
## PROSTITUTION                          0.0    34.0     34.0  1.000000
## ARSON                                25.0     9.0     34.0  0.264706
## STREET                            6001
## RESIDENCE                         3721
## APARTMENT                         2391
## SIDEWALK                          2336
## OTHER                             2323
## RESTAURANT                        1624
## PARKING LOT/GARAGE(NON.RESID.)    1220
## DEPARTMENT STORE                  1200
## SMALL RETAIL STORE                1158
## RESIDENCE-GARAGE                   720
## GROCERY FOOD STORE                 570
## RESIDENCE PORCH/HALLWAY            531
## PARK PROPERTY                      525
## HOTEL/MOTEL                        482
## ALLEY                              475
## BAR OR TAVERN                      474
## RESIDENTIAL YARD (FRONT/BACK)      403
## VEHICLE NON-COMMERCIAL             368
## COMMERCIAL / BUSINESS OFFICE       368
## SCHOOL, PUBLIC, BUILDING           352
## Name: Location Description, dtype: int64
## STREET                            0.182829
## RESIDENCE                         0.296195
## APARTMENT                         0.369040
## SIDEWALK                          0.440210
## OTHER                             0.510983
## RESTAURANT                        0.560461
## PARKING LOT/GARAGE(NON.RESID.)    0.597630
## DEPARTMENT STORE                  0.634189
## SMALL RETAIL STORE                0.669470
## RESIDENCE-GARAGE                  0.691405
## GROCERY FOOD STORE                0.708771
## RESIDENCE PORCH/HALLWAY           0.724949
## PARK PROPERTY                     0.740944
## HOTEL/MOTEL                       0.755629
## ALLEY                             0.770100
## BAR OR TAVERN                     0.784541
## RESIDENTIAL YARD (FRONT/BACK)     0.796819
## VEHICLE NON-COMMERCIAL            0.808031
## COMMERCIAL / BUSINESS OFFICE      0.819243
## SCHOOL, PUBLIC, BUILDING          0.829967
## Name: Location Description, dtype: float64

Crimes by Day (Chicago 2015 - Districts 001, 016, 019):

Crimes by Month (Chicago 2015 - Districts 001, 016, 019):

% Crimes Cleared by Crime Type (Chicago 2015 - Districts 001, 016, 019):

Total # Crimes Cleared by Crime Type (Chicago 2015 - Districts 001, 016, 019):

ECDF for Location Descriptions (Chicago 2015 - Districts 001, 016, 019):

Location Descriptions by District (Chicago 2015 - Districts 001, 016, 019):